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Abstract 


As robots and other intelligent agents move from simple environments 
and problems to more complex, unstructured settings, manually pro- 
gramming their behavior has become increasingly challenging and ex- 
pensive. Often, it is easier for a teacher to demonstrate a desired be- 
havior rather than attempt to manually engineer it. This process of 
learning from demonstrations, and the study of algorithms to do so, is 
called imitation learning. This work provides an introduction to imi- 
tation learning. It covers the underlying assumptions, approaches, and 
how they relate; the rich set of algorithms developed to tackle the prob- 
lem; and advice on effective tools and implementation. 

We intend this paper to serve two audiences. First, we want to famil- 
iarize machine learning experts with the challenges of imitation learn- 
ing, particularly those arising in robotics, and the interesting theoreti- 
cal and practical distinctions between it and more familiar frameworks 
like statistical supervised learning theory and reinforcement learning. 
Second, we want to give roboticists and experts in applied artificial in- 
telligence a broader appreciation for the frameworks and tools available 
for imitation learning. 

We organize our work by dividing imitation learning into directly 
replicating desired behavior (sometimes called behavioral cloning [Bain 
and Sammut, 1996]) and learning the hidden objectives of the desired 
behavior from demonstrations (called inverse optimal control [Kalman, 
1964] or inverse reinforcement learning [Russell, 1998]). In addition to 
method analysis, we discuss the design decisions a practitioner must 
make when selecting an imitation learning approach. Moreover, appli- 
cation examples—such as robots that play table tennis [Kober and 
Peters, 2009] and programs that play the game of Go [Silver et al., 
2016|— illustrate the properties and motivations behind different forms 
of imitation learning. We conclude by presenting a set of open questions 
and point towards possible future research directions. 
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Introduction 


Programming autonomous behavior in machines and robots tradition- 
ally requires a specific set of skills and knowledge. However, human 
experts know how to demonstrate the desired task even if they do not 
know how to program the necessary behavior in a machine or robot. 
The purpose of imitation learning is to efficiently learn a desired be- 
havior by imitating an expert’s behavior. The application of imitation 
learning is not limited to physical systems. It can be a powerful tool 
to design autonomous behavior in systems such as web sites, computer 
games, and mobile applications. Any system that requires autonomous 
behavior similar to human experts can benefit from imitation learning. 

However, imitation learning may be essential for robotics. It is now 
considered to be a key technology for applications such as manufac- 
turing, elder care, and the service industry. These robots will be ex- 
pected to work closely with humans in a dramatic shift from prior 
uses of robots. Powerful robotic manipulators are dangerous and have 
therefore been used mainly in constrained, predefined industrial appli- 
cations; employees must undergo special training before working with 
them. This is changing due to recent advances in robotics from com- 
pute to the use of light, compliant, and safe robotic manipulators. They 
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are ideal for applications where robots work alongside people, such as 
collaborating with human operators and reducing the physical work- 
load of care givers. These applications require efficient, intuitive ways 
to teach robots the motions they need to perform from domain experts 
who may not possess special skills or knowledge about robotics. 

In recent years, imitation learning has been investigated as a way to 
efficiently and intuitively program autonomous behavior|Schaal, 1999, 
Argall et al., 2009, Billard et al., 2008, Billard and Grollman, 2013, 
Bagnell, 2015, Billard et al., 2016]. In imitation learning, a human 
demonstrates how to perform a task. A robotic system learns a pol- 
icy to execute the given task by imitating the demonstrated motions. 
Numerous imitation learning methods have been developed and imita- 
tion learning has become a gigantic field of research. As a consequence, 
capturing the entire field of imitation learning is not a trivial task. 

The purpose of this survey is to provide a structural understanding 
of existing imitation learning methods and its relationship with other 
fields from supervised learning to control theory. We will describe what 
has been developed in the field of imitation learning and what should 
be developed in the future. 


1.1 Key successes in Imitation Learning 


One of the earliest and most well-known imitation learning success sto- 
ries was the autonomous driving project Autonomous Land Vehicle In 
a Neural Network (ALVINN) at Carnegie Mellon University [Pomer- 
leau, 1988]. In ALVINN, a neural network learned how to map input 
images to discrete actions in order to drive a vehicle. ALVINN’s neu- 
ral network had one hidden layer with five units. Its input layer had 
30 by 32 units; its output layer had 30 units. Although the structure 
of this network was simple compared to modern neural networks with 
millions of parameters, the system succeeded at driving autonomously 
across the North American continent. 

The Kendama robot developed by Miyamoto et al. [1996] is an- 
other successful application of imitation learning. In the early days 
of imitation learning, roboticists were mainly interested in teaching 
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higher-level tasks from human demonstrations, such as “pick,” “move,” 
and “place” Kang and Ikeuchi [1993], Kuniyoshi et al. [1994]. In those 
settings, lower-level tasks were often considered to be simple, point-to- 
point motions. In the late 1990s, this focus shifted from task-level plan- 
ning to trajectory-level planning. The term “learning from demonstra- 
tion” has become very popular since its use by S. Schaal and G. Atke- 
son [Schaal, 1997, Atkeson and Schaal, 1997]. Since then, learning robot 
motions has been a key domain of imitation learning. 

Recently, learning from human demonstrations has benefited from 
developments in deep neural networks. Recurrent neural networks such 
as long short-term memory (LSTM) networks Hochreiter and Schmid- 
huber [1997] have played a significant role in demonstrating how 
to succeed in many previously difficult sequential tasks by learning 
from demonstrated data. This includes tasks for generating handwrit- 
ing Chung et al. [2015], natural language Wen et al. [2015], or image 
captions Karpathy and Fei-Fei [2015]. Furthermore, AlphaGo, the al- 
gorithm which was able to beat a human Go master and which we 
discuss in more detail in §3.4.2, initializes a deep neural network pol- 
icy from human demonstrations Silver et al. [2016]. Often these recent 
approaches require a large amount of data. In §3, we will discuss how 
to learn from data to reproduce observed behavior in specific problem 
settings. 


1.2 Imitation Learning from the Point of View of 
Robotics 


Imitation learning is a class of methods that reproduces desired þe- 
havior based on expert demonstrations. In many cases, the experts are 
human operators and the learners are robotic systems, Thus, imitation 
learning is a technique that enables skills to be transferred from hu- 
mans to robotic systems. To perform imitation learning, we need to 
develop a system that records demonstrations by experts and learns a 
policy to reproduce the demonstrated behavior from the recorded data. 
For this purpose, we need to answer the following questions. 
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General Aspects: 


1. Why and when should imitation learning be used? This 
question clarifies the motivation for using imitation learning and 
what we should do with it. 


2. Who should demonstrate? In many cases, the experts are hu- 
man operators. Many imitation learning methods implicitly as- 
sume that demonstrations are provided by a single expert. When 
multiple experts are available, we need to decide which one should 
be imitated or how we can incorporate demonstrations from mul- 
tiple experts. 


3. How should we record data of the expert demonstra- 
tions? There are multiple ways of recording the behavior of 
experts. For example, motion capture systems and teleoperated 
robotic systems record data from expert behavior. This choice is 
closely related to the embodiment problem between experts and 
learners, which will be discussed in §3.7.1. 


4. What should we imitate? The recorded data often includes 
redundant information about expert behavior. In such cases, fea- 
tures appropriate for the desired behavior should be selected. 
Meanwhile, the recorded data also includes unnecessary motions, 
which should not be imitated. The data must be segmented to 
extract the motions to be imitated. 


Algorithmic Aspects: 


5. How should we represent the policy? Expert behavior can 
be represented using methods such as symbolic representation, 
trajectory-based representation, and state-action space represen- 
tation. The choice depends largely on the design of the entire 
system. 


6. How should we learn the policy? Many algorithms for learn- 
ing the policy have been developed over the past several decades. 
The choice of the algorithm for learning the policy is closely re- 
lated to the choice of policy representation. 
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With regard to the first four questions, several survey papers on 
imitation learning [Argall et al., 2009, Billard et al., 2008, Billard and 
Grollman, 2013, Billard et al., 2016], provide a taxonomy of imitation 
learning from the perspective of robotics. Argall et al. [2009] indicate 
that it is essential to design an imitation learning system by considering 
the correspondence between the expert and the learner, data acquisi- 
tion methods, and limitations of the demonstration dataset. Billard 
et al. [2008, 2016] provide an overview of imitation learning methods 
and highlight techniques for trajectory learning. However, none of the 
previous review articles focused on the design decisions needed to de- 
velop new imitation learning algorithms to enable answering questions 
five and six related to the algorithmic aspects discussed above. In ad- 
dition, these articles did not discuss the algorithmic details of exist- 
ing methods because the enormous amount of prior work on imitation 
learning makes it challenging to cover the entire range of previous stud- 
ies. 

In this survey, we provide an overview of existing methods from 
the algorithmic point of view, which will be useful for both readers 
beginning the practice of imitation learning and readers who want to 
achieve a deeper understanding of the theoretical aspects of imitation 
learning. We discuss the design choices which one should consider in or- 
der to develop novel imitation learning algorithms. Although our survey 
cannot be exhaustive, we discuss the algorithmic details of existing al- 
gorithms as much as possible, which will be useful to readers who want 
to implement imitation learning techniques. Additionally, we develop 
an information theoretic understanding of existing methods, which will 
help readers to understand how existing methods relate to each other 
and figure out how they could be extended. 

Let us illustrate how different design choices of imitation learn- 
ing algorithms can be made in different applications. Figure 1.1 shows 
three applications of imitation learning: 1) an RC helicopter, 2) robotic 
surgery, and 3) quadruped robot locomotion. In these applications, de- 
sign of the policies for motion planning and control vary. Abbeel et al. 
[2010] demonstrates acrobatic RC helicopter flight by learning from tra- 
jectories demonstrated by a human expert. In this system, the desired 
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Control inputs UW Observation Y; 


Left-right tilt Accelerometers 


Forward-backward tilt Gyro sensors 


Magnetometers 
GPS 


Vertical rotational rate 


Roter collective pitch 


Vision system 


‘https:/commons.wikimedia.org/w/index.php?curid=1 1467562 


be E A 


Demonstration by experts 
= [Yo, Uo, Yr Ur] i) D = {demo N 


a i=1 


(a) Learning of acrobatic RC helicopter maneuvers [Abbeel et al., 2010]. The tra- 
jectories for acrobatic flights are learned from a human expert’s demonstrations. 
To control the system with highly nonlinear dynamics, iterative learning control 
was used. 


Control inputs U+ 
Position of the > 
master manipulator 
— 
—_— 2 


: Demonstration by experts 
= [Yo, Uo, Yrs UT > D= {ramo N 


a žl 


Observation Y; 


Position of the 
slave manipulator 


demo 


(b) Learning with a teleoperated system [Osa et al., 2014] where a posi- 
tion/velocity controller is available. To generalize the trajectory to different situ- 
ations, a mapping from task situations to trajectories is learned from demonstra- 
tions under various situations. 


Control inputs wu, Observation Y; 

Analog joystick i 

ae = Terrain features P, 

Foot step locations x 
\ 
Aua ~A ot 
emo = Wa üi TARTE D aon by mpais 
= 2 petty F _ lemo 
D= {r ea 


(c) Learning quadruped robot locomotion [Zucker et al., 2011]. The footstep plan- 
ning was addressed as an optimization of the reward/cost function, which was re- 
covered from the expert demonstrations. Learning the reward/cost function allows 
the footstep planning strategy to be generalized to different terrains. 


Figure 1.1: Observations y and control inputs u for imitation learning in (a) 
helicopter flight, (b) surgery, and (c) locomotion. Motion planning is formulated in 
different ways in these examples. 
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trajectories of acrobatic flights were learned from demonstrations with a 
supervised learning method. Osa et al. [2017b] also learned trajectories 
for autonomous knot tying from demonstrations by a human expert. To 
generalize a trajectory, Osa et al. [2017b] learned a direct mapping from 
task situations (contexts) to trajectories using demonstrations recorded 
under various situations. Contrary to [Abbeel et al., 2010, Osa et al., 
2017b], Zucker et al. [2011] formulated footstep planning for quadruped 
robot locomotion as an optimization of the reward/cost function. The 
reward/cost function was recovered from demonstrations. In [Zucker 
et al., 2011], learning the reward/cost function as a function of terrain 
features enables the footstep planning strategy to be generalized to dif- 
ferent terrains. Learning such reward/cost functions for manipulation 
tasks like as knot-tying [Osa et al., 2017b] is not trivial, since complex 
manipulation tasks often require nonlinear reward/cost functions. 

Methods for learning policies also differ between applications. The 
observation and control inputs of the RC helicopter system are much 
noisier than those of the other two systems, and its dynamics are highly 
nonlinear [Abbeel et al., 2010]. Therefore, it is essential to estimate the 
true state using various sensory information and learn an adaptive con- 
troller through iterations of trials to achieve acrobatic RC helicopter 
flight. On the other hand, we can assume that the system state is 
precisely known and a position/velocity controller is available in the 
case of the tele-operation system in [Osa et al., 2014], which simplifies 
imitation learning significantly. In [Osa et al., 2014], the conditional 
trajectory distribution given a context can be learned with a simple re- 
gression method, and the planned trajectory can be executed by a stan- 
dard velocity controller. In locomotion planning for a quadruped robot 
in [Zucker et al., 2011], estimating the reward/cost function requires 
an iterative learning process with virtual simulation of the learned pol- 
icy. As one can see from these examples, learning methods can be very 
different between applications. 

To apply imitation learning, it is essential to identify the structure 
of the system, formulate a given problem, and design an algorithm to 
solve the problem efficiently. In this survey, we focus on the algorithmic 
aspects of imitation and discuss necessary design choices, exploring 
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various solutions proposed by previous studies. 

In the rest of this chapter, we introduce several concepts in machine 
learning that are essential to understand imitation learning algorithms. 
We discuss the design choices of imitation learning algorithms in Chap- 
ter 2. We describe the details of behavioral cloning methods and inverse 
reinforcement learning methods in Chapters 3 and 4, respectively. To 
conclude, we list open questions of imitation learning in Chapter 5. 


1.3 Key Differences between Imitation Learning and 
Supervised Learning 


The imitation learning problem has special properties that distinguish 
it from the better known supervised learning setting [Shalev-Shwartz 
and Ben-David, 2014] : 1) the solution may have important structural 
properties including constraints (for example, robot joint limits), dy- 
namic smoothness and stability, or leading to a coherent, multi-step 
plan [Bagnell, 2015]; 2) the interaction between the learner’s decisions 
and its own input distribution (an on-policy versus off-policy distinc- 
tion) , and 3) the increased necessity of minimizing the typically high 
cost of gathering examples. 

As we learn a policy m from a dataset D, imitation learning is 
closely related to supervised learning, and is particularly related to 
the field of structured prediction [Daumé III et al., 2009, Ratliff et al., 
2006a, Taskar, 2005] , where the task is to learn a mapping from in- 
puts x to a complex, structured output y (plans, parse trees, com- 
plex motions). Reductions of structured prediction to sequential deci- 
sion [Daumé III et al., 2009], and reductions of imitation learning to 
structured prediction [Ratliff et al., 2006b] show the close connection, 
and cross-fertilization between these research areas has been important 
for both. In practice, distinctions arise because of the structural prop- 
erties of policies we attempt to imitate, and the difficulty of "resetting" 
state and restarting predictions is too costly or even infeasible in most 
imitation learning settings because a physical system is often involved. 

In addition, it is often the case that the embodiments of the expert 
and the learner are different. For example, when transferring human 
skills to a humanoid robot, the motion captured from a human expert 
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may be infeasible for the humanoid. In such a case, the demonstrated 
motion needs to be adapted to be feasible for the humanoid. This kind 
of adaptation is less common in the standard supervised learning. 

In machine learning, the prediction problem where the source do- 
main distribution and the target domain distribution are different is of- 
ten referred to as “covariate shift” or “domain adaptation” (Sugiyama, 
2015]. In imitation learning, the source domain corresponds to expert 
demonstrations and the target domain to learner reproductions. In im- 
itation learning, the demonstration dataset does not cover all possible 
situations since collecting expert demonstrations to cover all situations 
is usually too expensive and time-consuming. As a result, the learner 
often encounters states which were not encountered by the expert dur- 
ing demonstrations, which means that the target domain distribution is 
different from the source distribution. Therefore, covariate shift or do- 
main adaptation is closely related to imitation learning [Bagnell, 2015]. 

Imitation learning is also closely related to reinforcement learn- 
ing (RL), which tries to obtain a policy that maximizes an expected 
reward [Sutton and Barto, 1998] signal. In RL, we employ a reward 
function that encourages a desired behavior. However, in imitation 
learning we often assume optimal (or at least “good”) expert demon- 
strations which are not available in basic reinforcement learning, and 
which provide prior knowledge that allows for dramatically more effi- 
cient methods. Recent work by Sun et al. [2017] demonstrates a po- 
tentially exponential decrease in sample complexity in learning a task 
by imitation rather than by trial-and-error reinforcement learning, and 
empirical results have long shown such benefits [Silver et al., 2016, 
Kober and Peters, 2009, Abbeel et al., 2010]. Moreover, in the imi- 
tation learning setting, as we detail below, we may or may not have 
access to a true reward function. 


1.4 Insights for Machine Learning and Robotics Re- 
search 


As imitation learning offers intuitive ways to program robotic motions 
by demonstrating the desired motion, imitation learning attracted in- 
terests from robotic researchers. The robotics community has devel- 
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oped many imitation learning methods for motion planning and robot 
control. When planning a trajectory for a robotic system, it is often 
necessary to make sure that a planned trajectory satisfies some con- 
straints such as smooth convergence to a new goal state. For this rea- 
son, robotics researchers have developed “custom” trajectory represen- 
tations that explicitly satisfy constraints necessary for robotic appli- 
cations. Machine learning techniques are often used as a part of such 
frameworks. However, robotics researchers need to be aware that rich 
set of algorithms have been developed by the machine learning com- 
munity and some of new algorithms might eliminate the need for cus- 
tomizing policy or trajectory representation. 

For machine learning researchers, imitation learning offers interest- 
ing practical and theoretical problems, which differ from standard su- 
pervised and reinforcement learning settings. Although imitation learn- 
ing is closely related to structured prediction, it is often challenging to 
apply existing machine learning methods to imitation learning, espe- 
cially robotic applications. In imitation learning, collecting demonstra- 
tions and performing rollouts are often expensive and time-consuming. 
Therefore, it is necessary to consider how to minimize these costs and 
perform learning efficiently. In addition, embodiments and observabil- 
ity of the learner and the expert are different in many applications. In 
such cases, the demonstrated motion needs to be adapted based on the 
learner’s embodiment and observability. These difficulties in imitation 
learning present new challenges to machine learning researchers. 


1.5 Statistical Machine Learning Background 


To understand imitation learning algorithms, familiarity with several 
concepts in statistical machine learning is essential. In this section, we 
briefly introduce the notation we use and these concepts. 


1.5.1 Notation and Mathematical Formalization 


Before introducing important concepts in machine learning, we intro- 
duce the notation in this article. Table 1.1 summarizes our notation. 
Throughout this survey, we use the bold style for vector values, and the 
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non-bold style for scalar values. Demonstrations by an expert are often 
given as a set of trajectories. In this case, the dataset of demonstra- 
tions is given by D = {7°,...,77}. We use the lower script to denote 
the time index; x; represents the state of the system at time step t. 
We review many methods that manipulate probability distributions in 
various ways. To make equations concise, the probability distribution 
induced by the experts’ policy is denoted by q, and the distribution 
induced by the learner’s policy is denoted by p. For example, p(T) 
represents the probability distribution over trajectories induced by the 
learner’s policy. The term “action” is mainly used in machine learning 
community, and “control input” is mainly used in robotic community 
and control theory community. Since imitation learning methods have 
been developed in all of these communities, we use the word “action” 


Table 1.1: Table of Notation. We use a notation common in the control literature 
for states and controls. 


system state 


context 


feature vector 


control input /action 


trajectory 


policy 


dataset of demonstrations 


probability distribution induced by an expert’s policy 


probability distribution induced by a learner’s policy 


time 


finite horizon 


number of demonstrations 


Joys lspelyfslalelese/s 


superscript representing an expert 
e.g. nË denotes an expert’s policy 


superscript representing a learner 


om 


e.g. tl denotes a learner’s policy 


superscript representing a demonstration by an expert 
Jemo p p Pp g y p 


e.g. TIe™ denotes a trajectory demonstrated by an expert 
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and “control input” interchangeably. We use the term “context” to refer 
to the condition relevant to the task. The context s can be the initial 
state of the system ag or the state of relevant objects. For instance, the 
position of the ball can be part of the context in a hitting-a-ball task. 
We use T to denote the finite horizon of the trajectory. Therefore, the 
total number of the time steps of a single trajectory is T + 1 in our 
notation. 


1.5.2 Markov Property 


A sequence of states xo,...,a¢ is a Markov chain if at any time t, the 
future states £441, +2, ... depend on the history £o, ..., x; only through 
the present state a; [Serfozo, 2009]. In other words, the next state x41 
only depends on the current state x; in a Markov chain. This property 
is called the Markov property. 


1.5.3 Markov Decision Process 


A Markov decision process (MDP) is a process that satisfies the Markov 
property. If the state and action spaces are finite, then it is called a finite 
Markov decision process (finite MDP) [Sutton and Barto, 1998]. An 
MDP is defined as a tuple (¥,U, P, y, D, R). æ is a finite set of states; 
U is a set of control inputs; P is a set of state transitions probabilities; 
y € [1,0) is a discount factor; D is the initial-state distribution from 
which the initial state a) is drawn; and R : Æ +> R is the reward 
function. 


1.5.4 Entropy 


Given the random variable æ and its probability distribution p(a), the 
entropy 


H(p) =~ | ple) Inp(a)dew (1.1) 


is defined as the amount of information conveyed by transmitting 
x [Bishop, 2006]. Note that the entropy H(a) is a convex function. 
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1.5.5 Kullback-Leibler (KL) Divergence 


In the field of information geometry, the KL divergence is used to quan- 
tify a difference between two probability distributions[Kullback and 
Leibler, 1951], i.e., 


Dt (o(@)|la(e)) = f pw) n Pha. (1.2) 
q(x) 

Since the KL divergence identifies a difference between two probability 
distributions, it is useful for cases in which stochastic policies are go- 
ing to be learned, or stochastic trajectories result from a deterministic 
policy. Please note that the KL divergence is not symmetric, therefore 
Dxı (p||q) # Dxı (q||p). The KL divergence can be obtained as a Breg- 
man divergence derived from the negative entropy [Amari, 2016] and 
is widely used as a measure in multiple imitation learning approaches. 


1.5.6 Information and Moment Projections 


One common approach to learning a policy from a dataset is to consider 
“projecting” that dataset onto the space of the policy model. Informa- 
tion theory emphasizes two kinds of projections: the Information(I)- 
projection and the Moment(M)-projection [Bishop, 2006]. Using the 
Kullback-Leibler (KL) divergence [Kullback and Leibler, 1951], the I- 
projection is 


p= arg min Dpi (p la), (1.3) 
and, the M-projection 
p* = arg min Dgr (q || p) - (1.4) 


As the KL divergence is not symmetric, these two projections result in 
different solutions when a given distribution is multi-modal as shown in 
Figure 1.2. While the M-projection averages over the several modes, the 
I-projection concentrates on a single mode. Performing the I-projection 
is often not straight-forward, although the M-projection can often be 
performed relatively easily by maximizing the likelihood with respect 
to a given training dataset [Bishop, 2006]. 
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Figure 1.2: Illustration of I- and M- projections. Given a distribution with two 
modes as shown in black, M-projection will give a solution that averages over two 
modes as shown in red. On the contrary, I-projection will give a solution that con- 
centrates on one of the modes. 


1.5.7 The Maximum Entropy Principle 


Let us consider a probability distribution p(a) that matches the fea- 
tures of an unknown distribution q, i.e. it satisfies 


z ip(2)] = E,lġ(2)], 


where q(a) is an unknown probability distribution and Eg[ġ(æx)], which 


is the expectation of a feature function (a), is available. As there are 
typically an infinite amount of such distributions, we need an additional 
constraint to obtain a unique solution [Amari, 2016]. 

The maximum entropy principle [Jaynes, 1957] suggests to choose 
a distribution that maximizes the entropy 


H(p) =- f p(w) np(@)de 


among the distributions that satisfy E,|@(a)| = E,[@(«)|. From this 
constrained optimization program, the maximum entropy distribution 


can be computed as 
p(x) x exp (w! o(2)), (1.5) 


where w is a vector-valued Lagrangian multiplier for the feature match- 
ing constraint. While the maximum entropy principle does not directly 
translate into a practical algorithm, it uncovers an interesting obser- 
vation. Every distribution that is in a log-linear representation given 
by Equation 1.5, is the maximum entropy distribution that can match 
specific feature expectations given by the feature vector (a). This is 
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true for typical distributions from the exponential family such as the 
Gaussian distribution, which is the maximum entropy distribution that 
matches first and second order moments. The notion of Maximum En- 
tropy generalizes to Maximum Causal Entropy, which turns out to be 
a natural notion of uncertainty for dynamical systems [Ziebart et al., 
2013]. 


1.5.8 Background: Reinforcement Learning 


Reinforcement learning is a class of methods that autonomously learns 
policies through iterations of trials and evaluations. The goal of 
reinforcement learning is to learn a policy m that maps the state of 
the system to the control input so as to maximize the expected reward 
J(m). The reward r; represents the quality of the given state, action 
or trajectory at time t. For example, r; could be large when a robot is 
close to the desired trajectory and small when the robot is far from the 
trajectory, or, rg could be large for stable robot grasps and small for 
unstable ones. With a finite horizon T, the expected return is given by 
the accumulation of the reward at each time step, 


T 
J(t) =E [yon 
t=0 


Alternatively, the discounted accumulated reward is used for the infi- 


r ; (1.6) 


nite horizon scenario, i.e., 


J(r)=E bs art 
t=0 


where the discounted factor y controls the trade-off between shorter 


r! ; (1.7) 


term rewards and longer term rewards. The desired policy m* is given 
by 


= arg max Cae (1.8) 


The value of a state x under a policy 7 can be computed as the expected 
reward when starting from x and following 7 


V"(z)=E p yri 
t=0 


Lo =£, 7 : (1.9) 
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V"(az) is often called the value function [Sutton and Barto, 1998). 
Likewise, the value of taking action u in state x under a policy m can 
be computed as the expected reward when starting from the action u 
in a state x and thereafter following policy 7 


creat |Srn 


t=0 


£o = £, Uo = z : (1.10) 


Q7 (xz, uz) is often called the action-value function [Sutton and Barto, 
1998]. 

For an overview of reinforcement learning methods, please refer to 
[Sutton and Barto, 1998, Szepesvari, 2010, Wiering and van Otterlo, 
2012, Sugiyama et al., 2013] and for an overview in reinforcement learn- 
ing in robotics, please refer to Kober et al. [2013], Deisenroth et al. 
[2013b]. 


1.6 Formulation of the Imitation Learning Problem 


The goal of imitation learning is to learn a policy that reproduces the 
behavior of experts who demonstrate how to perform the desired task. 
Suppose that the behavior of the expert demonstrator (or the learner 
itself) can be observed as a trajectory T = [@o,...,@7], which is a 
sequence of features @. The features @, which can be the state of the 
robotic system or any other measurements, can be chosen according to 
the given problem. Please note that the features ¢ do not have to be 
manually specified, and @ could be as general as simply pixels in raw 
images. 

Often, the demonstrations are recorded under different conditions, 
for example, grasping an object at different locations. We will refer to 
these task conditions as context vector s of the task which is stored 
together with the feature trajectories. The context s can contain any 
information relevant to the task, e.g., the initial state of the robotic 
system or positions of target objects. Note that, as the context describes 
the current task, it is typically fixed during task execution and the only 
dynamic aspects of the problem are the state features @,. Optionally, 
a reward signal r that the expert is trying to optimize is also available 
in some problem settings [Ross and Bagnell, 2014]. 
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In imitation learning, we collect a dataset of demonstrations D = 
{(7i, 8:,7:)}, that consists of pairs of trajectories T, contexts s, and 
optionally reward signals r. The data collection process can be both of- 
fline and online. Using the collected dataset D, a common optimization- 
based strategy learns a policy x* that satisfies 


T“ = arg min D (q(¢), p(¢)) ; (1.11) 


where q(@) is the distribution of the features induced by the experts’ 
policy, p(@) is the distribution of the features induced by the learner, 
and D(q,p) is a similarity measure between q and p. Both offline and 
online learning scenarios of this problem have been considered [Ross 
et al., 2011]. Please note that, when the dataset contains demonstra- 
tions of multiple tasks and the contexts include information of each 
task, this problem can be considered multitask learning as in recent 
work by Duan et al. [2017], Finn et al. [2017a,b]. 

In addition, we often have access to an environment such as a sim- 
ulator or a physical robotic system where we can perform and evaluate 
a policy through interaction. This simulator can be used to gather new 
data and iteratively improve the policy to better match the demonstra- 
tions. 


2 


Design of Imitation Learning Algorithms 


In this chapter, we discuss the design choices of imitation learning 
methods. First, we describe what design choices need to be consid- 
ered, and we then discuss what options we can consider for each design 
decision. Thereafter, we discuss imitation learning methods from an 
information theoretic point of view. 


2.1 Design Choices for Imitation Learning Algorithms 


When developing an imitation learning method, it is necessary to make 
several design choices to formalize the problem. In this section, we 
present a list of some of these design choices. 


e Access to the reward function: imitation learning or 
reinforcement learning. A central distinction in imitation 
learning is whether or not the learner has access to both an expert 
demonstrator and a reward signal that the expert is attempting 
to optimize. For instance, in learning to play Atari games [Mnih 
et al., 2015] or play Go [Silver et al., 2016] there is an unambigu- 
ous score metric. On the other hand, there exists tasks where 
it is feasible for the expert to demonstrate the optimal behavior 


20 


2.1. 


Design Choices for Imitation Learning Algorithms 21 


and it is hard to define the reward manually including, learning 
to drive a car by demonstration [Pomerleau, 1988] and complex 
manipulation such as knot-tying [Osa et al., 2017b]. 


One might naturally ask what benefit is conferred by an expert if 
a reward signal is available- surely we can simply solve the prob- 
lem by reinforcement learning? The expert’s role is to reign in 
the need for tremendous and expensive global exploration. This 
has been consistently demonstrated empirically to speed learn- 
ing even on problems with a clear metric (e.g., the ball-in-a-cup 
task in [Kober and Peters, 2009]) and recently shown theoret- 
ically to provide a potentially exponential improvement in the 
number of samples required to learn [Sun et al., 2017]. The most 
common approach to leverage such information is initialize a pol- 
icy by imitation learning with coarse demonstration and refined 
by reinforcement learning through trial and error [Silver et al., 
2016, Tesauro, 1995]. Algorithms like SEARN [Daumé III et al., 
2009] and AggreVaTe [Ross and Bagnell, 2014, Sun et al., 2017], 
intermix the process of imitation and reinforcement- the learner 
attempts multiple actions and the expert provides the best strat- 
egy or an estimate of cost-to-go given the learner’s decision. This 
intermixing ensures that the learner is able (with enough samples 
and representational power) to recover a policy that is guaran- 
teed to be nearly as good as the expert (and can be much better), 
and prevents small mistakes from cascading into poor overall be- 
havior. 


The emergence of the “V-style jump” [Maryniak et al., 2009] 
shown in Figure 2.1 in ski jumping is a textbook example of such 
imitation learning by humans. Although it took decades to be 
recognized, soon after some athletes achieved successful results 
with the V-style jump in 1990s, it has become prominent in the 
sport and has been mastered by all the athletes performing ski 
jumps. This example illustrates that local optimization around 
the initial demonstration can only find local optima while imita- 
tion learning leads to fast skill acquisition. 
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Figure 2.1: A ski jumper flies through the air using the highly aerodynamic “V- 
style”. “V-style” was adopted by most ski jumpers in the 1990s after some jumpers 
demonstrated impressive results with the style (public domain picture from Wiki- 
media Commons). 


e Parsimonious description of the desired behavior: behav- 


ioral cloning or inverse reinforcement learning. Data effi- 
cient learning demands we identify the most compact represen- 
tation of a behavior. Often a direct mapping from features to 
trajectories/actions is the most parsimonious description of the 
policy and the approach known as behavioral cloning approach is 
used. However, particularly for problems where the behavior is, 
crudely speaking, deliberative and focused on long-horizon plan- 
ning, the most parsimonious description of the policy may be 
to encode the policy as the solution of an optimization or plan- 
ning problem [Ratliff et al., 2009, Bagnell, 2015] Inverse Optimal 
Control approaches learn a (surrogate) cost function so that the 
behavior that results from solving that optimization is in some 
sense similar to that demonstrated by the expert. 


Access to system dynamics: model-based or model-free. 
Access to system dynamics is required for making some prob- 
lems tractable. For instance, estimation of the system dynamics 
is often required for motion planning in under-actuated robots, 
in which accurate controllers are not available. Meanwhile, ac- 
cess to the system is not necessary when a controller of sufficient 
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quality is available. It is desirable to avoid learning of the system 
dynamics because it is not a trivial problem. Thus, it is essential 
to identify whether access to system dynamics is necessary for 
controlling the given system or not. 


Similarity measure between policies. In the event that there 
is not a clear notion of reward function being optimized, a sur- 
rogate notion of similarity between the experts’ policy and the 
learner’s policy needs to be established to reproduce the behav- 
ior of the expert. This similarity can be defined at the level of 
individual decisions, although it is usual preferred that the notion 
of similarity be defined over trajectories the learner and system 
take together [Ziebart et al., 2013]. 


Features. It is essential to select appropriate features that en- 
able the desired behavior to be expressed. Features should contain 
enough information to solve the problem while limiting the com- 
plexity of learning. The features can be various measurements re- 
lated to the desired task, such as kinematic/dynamic state of the 
robotic system and/or the surrounding objects. Learning tech- 
niques, based on deeper representations have enabled features 
representations to be at least partially extracted automatically, 
e.g., using deep learning [Ratliff et al., 2006a, Bradley, 2010, 
Grubb and Bagnell, 2010, Levine et al., 2016, Ho and Ermon, 
2016, Finn et al., 2016b]. 


Policy representation. Policy representation needs to be cho- 
sen such that the desired behavior can be properly captured. For 
example, a policy can be represented by a neural network or a lin- 
ear function. With respect to the task abstraction level, we need 
to decide at which level of the task we learn, such as task level, 
trajectory level, and action-state level. While it is necessary to 
select a sufficiently informative representation to model the de- 
sired behavior, increasing the complexity of policy representation 
usually leads to the increase of the required training data and 
learning time. 
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As one can see above, these design choices are not independent and 
the order of these design choices are flexible. For example, the choice of 
similarity measures between policies is related to the choice of policy 
representations. In the following sections, we present possible options 
for some of these design choices. 


2.2 Behavioral Cloning and Inverse Reinforcement 
Learning 


One way to obtain a policy that reproduces the demonstrated behav- 
ior is to learn a policy that directly maps from the input to the ac- 
tion/trajectory. In problems, where a dataset of demonstrated trajec- 
tories with state-action pairs and contexts D = {(az, S+, uz)} is given, 
we can directly compute a mapping from states or/and contexts to 
control inputs as 

U = T (£4, S4). (2.1) 


This kind of policy can be usually obtained through a standard super- 
vised learning method. Learning a policy that directly maps from the 
state or/and the context to the control input is often referred to as 
Behavioral Cloning (BC) [Bain and Sammut, 1996]. 

Alternatively, given a reward signal, a policy can be obtained so as 
to maximize the expected return. Such a policy can be expressed as 


T = arg max J (ĉ), (2.2) 


where J(7) is the expectation of the accumulated reward given the pol- 
icy 7 as in (1.7). However, the reward function is considered unknown 
and needs to be recovered from expert demonstrations under the as- 
sumption that the demonstrations are (approximately) optimal w.r.t. 
this reward function. Recovering the reward function from demonstra- 
tions is often referred to as Inverse Reinforcement Learning (IRL) [Rus- 
sell, 1998] or Inverse Optimal Control (IOC) [Moylan and Anderson, 
1973]. 

BC and IRL form two major classes of imitation learning methods. 
In order to select one of BC and IRL, it is essential to consider what is 
the most parsimonious description of the desired behavior? The policy 
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learned by an IRL method is valid as long as the estimated reward 
function represents the desired behavior appropriately, while a policy 
learned by a BC method is valid as long as the learned mapping from 
states to actions is valid. A choice between BC and IRL is to select the 
best way to describe the desired behavior, which is totally dependent 
on a given problem setting. It is essential to analyze how the desired 
behavior should be performed when applying imitation learning meth- 
ods. 


2.3 Model-Free and Model-Based Imitation Learning 
Methods 


Whether we access the system dynamics for imitation learning or not 
is one of the crucial design decisions. Although learning and leveraging 
the system dynamics often enables data-efficient learning with a system 
that has nonlinear and unknown dynamics, learning the system dynam- 
ics can be often challenging. In the reinforcement learning literature, 
methods that learn a forward model of the system and leverage it for 
learning a policy are often referred to as model-based, while methods 
that do not explicitly learn a forward model of the system are referred 
to as model-free [Kober et al., 2013, Deisenroth et al., 2013b]. In this 
survey, we apply the same categorization to imitation learning meth- 
ods. Table 2.1 shows a summary of the advantages and disadvantages 
of model-free and model-based methods in imitation learning. 
Model-free imitation learning methods attempt to learn a policy 
that reproduce the behavior demonstrated by experts without learn- 
ing/using a forward model of the system. Therefore, there is no need to 
estimate the system dynamics in model-free imitation learning method. 
Yet, the system dynamics is encoded only implicitly in policies learned 
by model-free methods. In many robotic systems, especially in indus- 
trial applications, position/velocity controllers are often available for 
controlling joints. In such cases, we can assume that the robot is fully 
actuated, and the dynamics of the system is almost negligible in motion 
planning if a reasonably smooth trajectory is used. Model-free imita- 
tion learning methods can be easily applied to motion planning for such 
(nearly) fully-actuated robotic systems when the demonstrations by ex- 
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perts are available. For this reason, behavioral cloning methods which 
learn a direct mapping from states/contexts to actions have focused on 
model-free methods until recent years. 

For motion planning of underactuated systems, it is often neces- 
sary to plan a feasible trajectory by considering the system dynamics. 
It can be challenging to use model-free BC methods to learn trajec- 
tories in such underactuated systems where the reachable states are 
limited. However, recent IRL work by Boularias et al. [2011], Finn 
et al. [2016b], Ho and Ermon [2016] shows how one can learn skills 
in underactuated systems through iterative rollouts without explicitly 
learning a dynamics model. 

Model-based imitation learning methods attempt to learn a policy 
that reproduces the demonstrated behavior by learning/using the sys- 
tem dynamics, e.g. a forward model of the system. This property can 
be critical especially for underactuated robots. Since underactuation 
limits the number of reachable states, it is essential to take into ac- 
count the dynamics of the system when planning feasible trajectories. 
Moreover, the prior knowledge of the system dynamics makes inverse 
reinforcement learning easier since the learner’s performance can be 
easily predicted when the system dynamics is known. However, in a 


Table 2.1: Advantages and disadvantages of model-based and model-free methods 
in imitation learning. Model-free methods learn a policy without knowledge on the 
system dynamics, and the system dynamics is encoded only implicitly in policies. 
Model-based methods learn a policy that explicitly satisfies the system dynamics by 
leveraging the system dynamics. However, learning/estimating the system dynamics 
can be challenging. 


Model-free Model-based 


A policy can be | The learning process can 
learned without learn- | be data-efficient. 
ing/estimating the system | A learned policy satisfies 
dynamics. the system dynamics. 


Advantages 


The prediction of future | Model learning can be 
states is difficult. difficult. 

Disadvantages | The system dynamics is | Computationally expen- 
only implicitly considered | sive. 

in the resulting policy. 
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real robotic system, it is often challenging to learn the system dynam- 
ics. For example, it is hard to model the contact between deformable 
objects, and it will be difficult to apply model-based methods to tasks 
that involve such contacts. 

Existing imitation learning methods can be categorized into be- 
havioral cloning and inverse reinforcement learning with a distinction 
between model-free and model-based methods as shown in Table 2.2. 
At a glance, one can see that studies on behavioral cloning have focused 


Table 2.2: Categorization of existing imitation learning methods with distinction 
between model-free and model-based methods. Model-free methods are dominant in 
behavioral cloning, and model-based methods are dominant in inverse reinforcement 
learning. Recent studies on IRL have proposed model-free methods. 


Model-free Model-based 


Widrow and Smith [1964], 
Chambers and Michie 
[1969], Pomerleau [1988], 
Schaal et al. [2004], 
Schaal [1999],  Ijspeert 
et al. [2013],  Calinon 
, et al. [2007], Khansari- 
Behavioral Zadeh and Billard [2011], 
Cloning Paraschos et al. [2013], 
Osa et al. [2014], Ross 
and Bagnell [2010], Ross 
et al. [2011], Takano and 
Nakamura [2015], Maeda 
et al. [2016], Denisa et al. 
[2016], Ho and Ermon 


et al. [2013], van den Berg 


| 
| 
| 
| 
| 
| 
| 
| 
| 
| Ude et al. [2004], Englert 
| 
| et al. [2010] 
| 
| 
| 
| 
| 
| 
| 
| 
| 


Abbeel and Ng [2004], 
' Ratliff et al. [2006b], Sil- 
Taversé i ver et al. [2010], Ziebart 
Reinforcement | Boularias et al. [2011] et al. [2008], Ziebart 
Learning Kalakrishnan et al. [2013] ı [2010], Levine et al. [2011], 
Levine and Koltun [2012], 
Hadfield-Menell et al. 
| [2016], Finn et al. [2016b] 
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Expert Demonstration Learner 


TE mD={(7i,5:)} rt 


Figure 2.2: Diagram of general imitation learning. The learner cannot directly 
observe the expert’s policy in many problems. Instead, a set of trajectories induced 
by the expert’s policy is available in imitation learning. The learner estimates the 
policy that reproduces the expert’s behavior using the given demonstrations. Please 
note that the process of querying the demonstration and updating the learner’s 
policy can be interactive. 


on model-free methods and studies on inverse reinforcement learning 
have focused on model-based methods, although recent studies on IRL 
have proposed model-free methods. BC methods have been mainly fo- 
cused on trajectory planning for robotic systems in which a lower-level 
controller is available. A model-free approach is a reasonable choice in 
such applications because the dynamics of the system is not crucial. 
On the other hand, IRL has focused on learning a policy in action- 
state space which needs to be iteratively evaluated in a given system. 
A model-based approach is suitable for such applications, and this is 
why many model-based methods have been developed for IRL. 


2.4 Observability 


The main goal of many imitation learning methods is to learn a pol- 
icy that reproduces the expert’s behavior. Since the expert’s policy 
cannot be directly observed, the learner recovers the policy from the 
expert’s demonstrations. The diagram in Figure 2.2 illustrates the im- 
itation learning process. To formulate a imitation learning problem, it 
is necessary to consider the observability in practice. 

For a formal definition, it is necessary to figure out observability of 
the state. Observability can vary significantly between different appli- 
cations leading to different kinds of learning methods. 


2.4. Observability 29 


2.4.1 Trajectories in Fully Observable Settings 


When the state of the system is fully observable, we can obtain a tra- 
jectory as a sequence of the state and the control input as 


T = [|£0, U0, £1, Cy 22s, ET, UT). (2.3) 


For instance, both the state and the control inputs are observable in a 
teleoperated system in [Abbeel et al., 2010, van den Berg et al., 2010, 
Osa et al., 2014, Ross et al., 2011], although observation can be noisy. 


2.4.2 Trajectories in Partially Observable Settings 


In some settings of imitation learning, the control input by the experts 
is not observable in demonstrations, and only the states of the system 
during the demonstrations are given. In such cases, the trajectory is 
given as a sequence of the state of the system, 


T= [E0 Myre Ts (2.4) 


For example, the control inputs to achieve the demonstrated trajectory 
are often unobservable in kinesthetic teaching [Kober and Peters, 2009, 
Englert et al., 2013, Maeda et al., 2016]. Also, when transferring mo- 
tions captured from a human expert to a humanoid robot the control 
inputs to achieve the desired motion in the learner’s embodiment can- 
not be observed [Ijspeert et al., 2002b, Grimes et al., 2006b, Grimes 
and Rao, 2009]. In addition, the state of the system is often partially 
observable. In this case, the trajectory is given as a sequence of the 
partial observation of the system, 


T= [Yo Yis- +> Yrl- (2.5) 


where y; is the partial observation of the system, which is often given 
by y; = f(a) where f, is the observation function. As a special case, 
the observation y can be linear w.r.t. the state x as y, = Ha; where 
Hi, is the observation matrix. 
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2.4.3 Differences in observability between the expert and the 
learner 


In imitation learning, the expert and the learner often observe the 
environment differently. For example, in robotic manipulation tasks a 
human expert often obtains much richer sensory information compared 
to a robot learner due to the differences in their sensory embodiments. 
As another example, a robotic learner may be able to record sensory 
information more accurately and at a higher rate than a human ex- 
pert. In such cases, the information of the learner about the environ- 
ment/system differs from the information of the expert and should be 
taken into account when formalizing the imitation learning problem. 
In general, the observability of the expert and learner can manifest in 
different ways: 


e The expert observes partially 
— the system state 
— the control inputs by the expert 


— learner’s observations 


e The learner observes partially 
— the system state 
— expert’s observations 
— the control inputs by the expert 


— the control inputs by the learner 


These cases need to be taken into account when deciding on the im- 
itation learning approach for a specific application. When the expert 
observes the system state partially, the expert demonstrations can be- 
come sub-optimal requiring careful consideration. Moreover, when the 
expert observes the learner, the learner may have more information 
about its own embodiment. For example, if a human expert uses kines- 
thetic teaching to show how to grasp an object, the demonstration may 
be sub-optimal for a robot learner if the expert does not see what the 
robot observes. 

In imitation learning, the expert is often assumed to behave opti- 
mally. However, this optimality is often based on partial observations 
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which may differ significantly from the observations of the learner. For 
example, if the human expert performs a motion which goes around 
an obstacle which the robot learner does not observe, a robot learner 
learns to perform a similar circumnavigation motion even when there 
are no obstacles. Moreover, when the learner observes only partially 
expert observations the learner can make wrong predictions about the 
policy behind expert behavior. 


2.5 Policy Representation in Imitation Learning 


One of the important design choices in imitation learning is policy 
representation. In this section, we discuss the design choices related to 
policy representation. 


2.5.1 Levels of Policy Abstraction 


For imitation learning, several types of policy abstractions can be 
used. We can categorize the policy representations into three types: 
1) symbolic-level abstraction, 2) trajectory-level abstraction, and 
3) action-state space abstraction. In task level planning, the learner 
learns a policy that generates an option o € O where O is the set of 
options. Options are often defined as policies of taking actions over a 
period of time [Sutton et al., 1999]. In this task-level planning, each 
option often consists of a set of actions or trajectories. A policy maps 
given states a; and contexts s to sequences of options in the task-level 
abstraction. 


T: £4, S> [01,..., 07], (2.6) 


where T is the horizon of the task. A complex task is often hard to 
model as a single movement. The task-level abstraction enables model- 
ing such complex task as a sequence of simple movements. BC methods 
such as [Konidaris et al., 2011, Niekum et al., 2014, Kroemer et al., 
2015] model complex task as a sequence of movement primitives. 

In trajectory planning, a policy maps a context s to a trajectory T 
that is a sequence of the state of the system a (and control inputs u) 
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as 
Ti8ShHT. (2.7) 


BC methods such as DMP [Schaal et al., 2004, Ijspeert et al., 2013] 
and ProMP [Paraschos et al., 2013, Maeda et al., 2016] learn such 
trajectory-based policies. 

In the action-state space level, a policy maps states of the system 
xı and contexts s to control inputs uz as 


T | Li, S = Ut. (2.8) 


BC methods such as [Chambers and Michie, 1969, Pomerleau, 1988, 
Khansari-Zadeh and Billard, 2011, Ross et al., 2011] and IRL methods 
such as [Abbeel and Ng, 2004, Ziebart et al., 2008, Boularias et al., 
2011, Finn et al., 2016b] learn policies in action-state space. These 
abstractions are summarized in Table 2.3. 

Existing imitation learning methods can be categorized based on 
task abstractions as shown in Table 2.4. The table displays an abun- 
dance of model-free methods for trajectory learning. On the contrary, 
many model-based IRL methods have been developed with action-space 
space abstractions. Since commercially available robotic manipulators 
often have a position/velocity controller, model-free methods are pre- 
ferred for trajectory planning in such systems. This is especially pro- 
nounced in motion planning methods designed for robotic manipulators 


Table 2.3: Abstraction and the related policy in imitation learning. In a task- 
level abstraction, the policy maps from the initial state £o to a sequence of discrete 
options, where an option at time step t is denoted with o;. In a trajectory-level 
abstraction, the policy maps from an initial state £o to a trajectory T. In an action- 
state space abstraction, the policy maps from the current state x+ to a control us. 


Abstraction Level Policy 
Task-level abstraction T: £,S > [01,...,07] 
Trajectory-based abstraction Tips >T 


Action-state space abstraction Ti | Li, S > Ut 
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in the robotics research community. On the other hand, the machine 
learning community have developed many IRL methods for learning a 
policy in action-state space. 


2.5.2 Hierarchical vs Monolithic Policies 


When we consider a single abstraction level of policy, the policy will 
be non-hierarchical/monolithic. BC methods such as [Chambers and 
Michie, 1969, Pomerleau, 1988, Schaal et al., 2004, Khansari-Zadeh 
and Billard, 2011, Paraschos et al., 2013, Ross et al., 2011] and IRL 
methods such as [Abbeel and Ng, 2004, Ratliff et al., 2006b, Ziebart, 
2010, Finn et al., 2016b] are monolithic. Thus far, numerous methods 
have been developed for learning a monolithic policy. However, we need 
to employ a complex policy representation such as a neural network 


Table 2.4: Categorization of imitation learning methods based on different policy 
abstractions with distinction between model-free and model-based methods. Many 
model-free methods have been developed for imitation learning with trajectory- 
based abstractions. On the contrary, many model-based IRL methods have been 
developed with action-space space abstractions. 


Model-free Model-based 


l 
Takano and Nakamura 
Task-level [2015], Niekum et al. [2014], ! 
abstration Konidaris et al. [2014], 

Inamura et al. [2004] 


Schaal et al. [2004], Schaal 
[1999], Ijspeert et al. [2013], 


Trajectory- Calinon et al. 2007], , Ude et al. [2004], Englert 
based Khansari-Zadeh and Bil- ı et al. [2013], van den Berg 
abstraction lard [2011], Paraschos et al. ! et al. [2010], Abbeel et al. 


[2013], Osa et al. [2014], 
Maeda et al. [2016] 
et al. [2016] 


[2010] 


Chambers and Michie [1969], ! 
Widrow and Smith [1964], , 
Action- Pomerleau [1988], Ross | 
state space | and Bagnell [2010], Ross | 
abstraction et al. [2011], Boularias et al. , 
[2011], Kalakrishnan et al. | 


Abbeel and Ng [2004], Ratliff 
et al. [2006b], Silver et al. 
[2010], Ziebart et al. [2008], 
Ziebart [2010], Levine et al. 
[2011], Levine and Koltun 
[2012], Hadfield-Menell et al. 
[2016], Finn et al. [2016b] 


[2013], Ho and Ermon [2016] 
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policy in [Finn et al., 2016b] in order to learn a complex task with a 
monolithic policy. 

On the contrary, by combining the different levels of abstraction, 
we can learn a hierarchical policy where the lower-level policies learn 
to perform the primitive behavior and the upper-level policy learns 
to plan a sequence of the lower-level policies. BC methods such as 
[Niekum et al., 2014, Konidaris et al., 2014, Kroemer et al., 2015] and 
IRL methods such as [Kolter et al., 2008, Choi and Kim, 2015, Krishnan 
et al., 2016] learn hierarchical policies. Since a hierarchical policy can 
be decomposed into a sequence of the lower-level policies, we do not 
have to use complex policy representation for the lower-level policies. 
On the other hand, it is not trivial to learn all of the lower-level and 
upper-level policies simultaneously. 


2.5.3 Feedback vs Open-Loop/Feedback-Free Policies 


With regard to feedback of the state, policies can be categorized into 
two types: feedback and open-loop/feedback-free policies. A feedback 
policy iteratively determines the control input/desired behavior based 
on the feedback from the environment. In other words, a feedback policy 
considers the changes of the environment caused by the previous control 
input in sequential decision making. A policy for determining the torque 
input to a robotic manipulator is often learned in robotic applications 
such as [Boularias et al., 2011, Englert et al., 2013]. Such a torque- 
based control is often learned as a feedback policy since it is essential 
to consider the state of the system in sequential decision making where 
a small mistake can cause a big error in the next state. 

In contrast, an open-loop/feedback-free policy determines the con- 
trol input/desired behavior just based on the initial input. Therefore, 
once a open-loop policy starts running, it does not use the feedback 
from the environment. A policy for planning a desired trajectory can 
be often learned as an open-loop policy since it can be addressed as 
a one shot decision making for a given situation such as in [Calinon 
et al., 2007, Takano and Nakamura, 2015]. However, there are methods 
for planning and updating the desired trajectory during the task execu- 
tion such as [Ijspeert et al., 2013, Paraschos et al., 2013, Schulman et al., 
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Non-stationary 
deterministic 


Stationary 
deterministic policy 


Non-stationary 
stochastic 


Stationary 
stochastic 


Figure 2.3: Illustration of the relationships between basic policy classes. Stationar- 
ity is a special case of non-stationarity and determinism is a special case of stochas- 
ticity. We use the terms “stationary” and “time-invariant” interchangeably. Likewise, 
“non-stationary” and “time-variant” are used interchangeably. Please see § 2.5.4 for 
more details. 


2013, Osa et al., 2017b]. For example, in the framework of [Schulman 
et al., 2013], the trajectory is learned as a direct function of the pixel 
values observed, and the desired trajectory is updated online. 

Different policy types can be used in the same system at the dif- 
ferent level. In the acrobatic helicopter flights by Abbeel et al. [2010], 
the scheme for planning the desired trajectory can be interpreted as 
an open-loop policy because the system does not update the desired 
trajectory during the flight. Meanwhile, an iterative LQR controller for 
the lower-level control in [Abbeel et al., 2010] can be considered as a 
feedback policy because it determines the control input based on the 
observation of the system. 


2.5.4 Stationarity and Stochasticity of Policies 


With respect to stationarity, we can categorize policies into stationary 
and non-stationary policies, depending on whether the policy depends 
on time. Moreover, we can categorize policies into deterministic and 
stochastic policies in terms of stochasticity. Note that stationarity is 
a special case of non-stationarity and determinism is a special case of 
stochasticity. Figure 2.3 illustrates relationships between these policy 
classes. 
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2.5.4.1 Stationary vs. Non-Stationary Policies 


A non-stationary (time-variant) policy depends on time. Typically tra- 
jectory based policies are non-stationary since the policy depends on 
the time step or phase of the trajectory. For example, a complex move- 
ment of a robot arm through space [van den Berg et al., 2010, Osa 
et al., 2014] needs to be performed such that the learned speed is sim- 
ilar to the demonstrated speed over the whole trajectory, which often 
requires a non-stationary policy. A stationary (time-invariant) policy 
depends only on the current state of the system. Stationary policies are 
typically used in applications where data from different time steps can 
be similar. For example, in a racing car simulation [Abbeel and Ng, 
2004, Ross et al., 2011], steering right when about to drive left off the 
road is a useful action independent of the time this occurs. In another 
instance, simple motion for approaching an object can be also learned 
as a stationary policy [Khansari-Zadeh and Billard, 2011]. 


2.5.4.2 Deterministic Policy 


A deterministic policy for trajectory planning determines a unique tra- 
jectory T for a given initial state a and/or context s as 


T = (ag, 8): (2.9) 


Behavior cloning methods such as dynamic movement primi- 
tives [Ijspeert et al., 2013, Schaal et al., 2004] can be interpreted as 
deterministic policy representations for trajectory planning. 

A deterministic policy in action-state space determines a unique 
control input u for a given state x and/or context s as 


w= (x, 8). (2.10) 


In this case, m represents a deterministic function of a. When a deter- 
ministic policy is used and both states and actions are fully observable, 


the distribution of the trajectory T = [ao, uo,..., £p, UT] is given as 
T 
p(T) = p(ao) | [ p(awesslae, 7 (a2)). (2.11) 


t=1 
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Commonly, in non-adversarial sequential decision making problems, 
such as Markov decision processes, the optimal policy for accom- 
plishing the objective for a given model is deterministic. Inverse 
reinforcement learning methods such as MMP [Ratliff et al., 2006b] 
and LEARCH [Ratliff et al., 2009, Zucker et al., 2011] employ a de- 
terministic policy derived from a reward/cost function recovered from 
demonstrations. Behavior cloning methods such as [Pomerleau, 1988, 
Khansari-Zadeh and Billard, 2011] also employ deterministic policies. 


2.5.4.3 Stochastic Policy 


A stochastic policy in action-state space draws a control input u ac- 
cording to a probability distribution for a given state x and/or context 
s as 


u ~ t(u|z, s). (2.12) 


In this case, 7 represents a conditional distribution of the control input 
u given x and s. If the probability distribution is given as a delta 
function, the policy is deterministic. When a stochastic policy is used 
and both states and actions are fully observable, the distribution of the 


trajectory T = [xo, Uo,..., £r, UT] is given as 
T 
p(T) = p(ao) ] | plarsilae, we) (ener). (2.13) 
t=1 


A stochastic policy is useful to model the stochastic behavior of the 
expert. Inverse reinforcement learning methods such as [Ziebart et al., 
2008, Boularias et al., 2011] employ a stochastic policy to represent such 
stochastic behavior. Stochastic policies introduce uncertainty, which 
is useful for exploring the parameter space of the policy in iterative 
methods. Model-based behavior cloning methods such as [Englert et al., 
2013] and inverse reinforcement learning methods such as [Finn et al., 
2016b] employ a stochastic policy and learn system dynamics through 
iterative learning. 
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2.6 Behavior Descriptors 


In imitation learning, it is essential to quantify the behavior and mea- 
sure the difference between the expert’s behavior and the learner’s be- 
havior. For this purpose, we need to consider “what should be matched 
between the expert and the learner?” In the following, we list descrip- 
tors of behavior, which can be matched between the expert and the 
learner in imitation learning. 


2.6.1 State-action Distribution 


Given a dataset D = {(x,u)} that consists of state-control input pairs, 
we can model the joint distribution of the state and the control in- 
put p(a,u) or the conditional distribution of the control input given 
the state p(u|æ). Early imitation learning approaches [Chambers and 
Michie, 1969, Widrow and Smith, 1964, Pomerleau, 1988] learned a 
policy by modeling state-action distributions using supervised learning 
methods. However, since a state-action distribution only describes the 
short term behavior, matching only the state-action distribution can 
lead to a mismatch with long term behavior. 


2.6.2 Trajectory Feature Expectation 


To match the behavior between the expert and the learner over a long 
horizon, it is necessary to consider trajectory features. Since both a 
trajectory itself and observations of it are often stochastic and noisy, 
the expectation of trajectory features (an expectation has less noise 
compared to individual instances) is often used to describe the behavior 
of the expert and the learner. The expectation of the trajectory features 
with respect to the learner’s policy is given by 


Em lo(r)l = | r(r)o(r)ar, (2.14) 


where p(T) is the trajectory distribution induced by the learner’s policy 


and (T) is the feature vector of the trajectory 7. The expectation of 


the trajectory E[r] can be interpreted as a special case of the feature 


expectation. When a dataset of trajectories D = {r¢em°}V, is avail- 


able, the expectation of the trajectory feature can be approximated 
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as 


1 N 
Bnl) = 2, prem’). (2.15) 


Feature expectation matching appears both in behavior 
cloning [Ijspeert et al., 2002a, Osa et al., 2014] and inverse 
reinforcement learning [Abbeel and Ng, 2004, Ratliff et al., 2006b, 
Ziebart et al., 2008). 


2.6.3 Trajectory Feature Distribution 


A distribution over trajectory features p(@(7)) is also often used for 
matching the behavior between the expert and the learner. We can 
match not only the first order moment (mean) of the distribution but 
also higher order moments. The trajectory distribution p(T) can be 
considered as a special case of the feature distribution. Behavior cloning 
methods such as [Paraschos et al., 2013, Englert et al., 2013] and inverse 
reinforcement learning methods such as [Arenz et al., 2016] use feature 
distributions. 


2.7 Information Theoretic Understanding of Feature 
Matching in Imitation Learning 


As we discussed in § 1.6, imitation learning can be formulated as a prob- 
lem of finding a policy that minimizes the difference between demon- 
strated and learned behavior. For this purpose, many imitation learning 
methods perform a “projection” of demonstrated behavior into a pa- 
rameterized policy space. Projecting demonstrations onto a manifold 
of a parameterized policy requires considering the relationship between 
the distribution of the demonstrations and the distribution of the pa- 
rameterized policy. Information theory provides a principled way of 
assessing this relationship. 
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arg min Dki (q(7)||p(7|w)) 


arg min Dgr (p(T |w)||q(7)) 


Data manifold Policy model manifold 


Figure 2.4: Illustration of M- and I- projections from the data manifold onto the 
policy model manifold. The solutions of M- and I- projections are different since the 
KL divergence is not symmetric. 


2.7.1 Information Theoretic Understanding of Imitation 
Learning Algorithms for Trajectory Learning 


Consider a trajectory distribution p(7|w) induced by a policy m with 
a parameter vector w. Supervised learning methods often obtain a 
solution based on the maximum likelihood of the given training data. 
As is well known Bishop [2006], maximizing the (causal) likelihood is 
equivalent to minimizing the KL divergence 


Dax a)l) = farina, (2.16) 


p(T|w) 
where q(T) is the empirical distribution over trajectories induced by 


the expert’s policy and 7 is a trajectory. This equation can be in- 
terpreted as a projection from the data manifold to the policy model 
manifold [Amari, 2016]. On the other hand, as the KL divergence is not 
symmetric, minimizing Dxz (q(T)||p(T|w)) is not equivalent to mini- 
mizing 


Dia (r(riwllar)) = f oiro ar, (2.17) 


which represents the projection from the policy model manifold to the 
data manifold. We discuss a few more details of minimizing the different 
projections in the following. 

First, we consider imitation learning methods for trajectory learn- 
ing based on the M-projection defined in (1.4). The goal of imitation 
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learning in this case is to learn a parameter vector w, such that the 
M-projection is minimized, i.e., 

w* = arg min Det (q(7)||p(T]w)) . (2.18) 


The resulting objective function can also be written as 


Lai = Dra. (olro) = farm 2 ar a9) 


= E,|Inq(7)] — E,[Inp(r|w)] , (2.20) 


where E,[-] is the expectation with respect to q(T) [Bishop, 2006, 


Sugiyama, 2015]. The expectations E,[-] in (2.20) can be estimated us- 


ing the demonstrated trajectories drawn from q(T). Since the first term 
in (2.20) is independent from w, Dpi (q(7)||p(7|w)) can be minimized 


by maximizing the expected log likelihood E,|In p(7|w)]. Hence, imita- 


tion learning based on simple supervised learning can be seen as a spe- 
cial case of computing the M-projection as these algorithms essentially 
perform a likelihood maximization. Examples of such algorithms are 
the least square solution for DMPs, expectation maximization (EM) 
for ProMPs, and EM for SEDS, which minimize Dpi (q(7)||p(7|w)) 
with different parameterizations [Ijspeert et al., 2013, Paraschos et al., 
2013, Khansari-Zadeh and Billard, 2011]. 

It is informative to note that there is a close relation between the 
maximum likelihood solution and the solution obtained from the prin- 
ciple of maximum entropy. Consider, for instance, average feature con- 
straints 


Ep[¢(r)] = a. (2.21) 


If we chose subject to the feature matching constraint the distribu- 


tion that results in maximum entropy, we cover the exponential family 
parametrization of p(T|w) Amari [2016]: 
_ exp (w! $(r)) 


p(t|w) = — z (2.22) 


Substituting the resulting form p(T|w) with (2.22) into the original 
maximum entropy problem ignoring terms which do not depend on the 
parameters w, the resulting dual objective function (or equivalently 
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the one in (2.20)) yields 


Lu = Elw! o(7)] — ln Z. (2.23) 


Differentiating (2.23) w.r.t. w yields the following gradient which can 
be used for optimization of the parameters: 


dLM 


dw 


= E,[¢(r)] - 1/Z I exp (w! (7) b(r)ar (2.24) 
= E,[6(r)] — E,[o(7)]. (2.25) 


Note that setting the gradient to 0 in order to obtain the optimum 
yields the optimality condition required to hold in the primal, that is 
that feature expectations match: 


Ey|O(7)] = Ealo (T)]. (2.26) 


From (2.26), we can conclude that maximum likelihood on an as- 
sumed exponential family form is also a solution to finding the max- 
imum entropy distribution (2.22) which respects the average feature 
constraint (2.26). The latter viewpoint allows us to reason about, for 
instance, cost function matching in Inverse Reinforcement Learning and 
to automatically construct an appropriate form for policies. 

This observation is called the maximum likelihood / maximum 
entropy duality Dudik and Schapire [2006]. Furthermore, as the M- 
projection yields the same solution as maximizing the likelihood, we 
can conclude that the M-projection solution for an exponential family 
of trajectory distributions is equivalent to the maximum entropy one. 

It is often useful to consider the maximum entropy principle in its 
regularized form [Ziebart et al., 2013] [Boularias et al., 2011], that is, 
instead of finding a maximum entropy distribution we want to find 
a distribution with the minimal KL divergence relative to a “prior” 
distribution po(7) while matching the features of the demonstrator, 
that is, 


arg min Dgr (p(7)||po(r)) (2.27) 
s.t.: E,[@(7)] = E,[@(7)]. (2.28) 
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The solution to this problem can again be obtained by the method of 
Lagrangian multipliers 


po(T) exp (w' 6(r)) 
Z 


p(r|w) = (2.29) 
with po(T) = exp (wd @(T)) /Zo. 

A particularly elegant result due to [Dudík and Schapire, 2006] 
demonstrates that if we have bounds on the accuracy with which 
our feature matching constraints hold, the resulting maximum entropy 
problem gives rise via duality to a regularized likelihood equivalent to 
a maximum a-posterior estimate with a prior on the dual parameters. 

It is, however, important to note that such maximum entropy prin- 
ciples should not to be confused with the I-projection, which computes 


arg min Dri (p(T|w)|la(T)). 


Here, the data is induced via the distribution q(T) on the right-hand 
side of the KL, while in the maximum entropy principle, the data is 
induced by the feature averages and po(T) on the right-hand side of 
the KL is just a prior. The I-projection does not match features of 
the demonstrator. Whenever an algorithm matches average features, 
it is an instance of an M-projection based algorithm. Since In q(T) is 
unknown and hard to evaluate in practice, it is challenging to perform 
the I-projection in the context of imitation learning. To the best of our 
knowledge, there is no existing imitation learning method that performs 
the I-projection exactly. 

As we have seen from our discussion above, many imitation learning 
methods can be seen as related to the M-projection and to the principle 
of maximum entropy. This is true for most model-free and model-based 
methods. Model-free methods based on standard supervised learning 
[Ijspeert et al., 2013, Khansari-Zadeh and Billard, 2011] do not require 
access to the system dynamics or iterative data acquisition. 

In contrast, model-based imitation learning methods often try to 
match features of the state distribution so as to satisfy E,[o(7)] = 
E~|@(7)]. In order to do so, we either need access to the system dy- 
namics [Ziebart et al., 2008, Ziebart, 2010] or require iterative data 
acquisition [Boularias et al., 2011]. 
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2.7.2 Information Theoretic Understanding of Imitation 
Learning Algorithms in Action-State Space 


In this section, we have a look at imitation learning in action-state 
space from an information theoretic point of view. In a Markov model, 
the probability distribution over trajectories p(T) can be decomposed 
as a sequence of states and actions 


T 
p(T) = p(xo) I] P(@r41|Le, u)r (uile), (2.30) 
t=0 
where the policy 7(u;|a;) maps from the states of the system to the 
control inputs. Let us consider the trajectory distribution p(T) induced 
by the learner’s policy and the trajectory distribution q(T) induced by 
the expert’s policy. If the embodiments of the learner and the expert 
are equivalent and stationary, that is, q(£441|£t, Ut) = p(@141|t, Ut) = 
p(£i|£t—1, Ut—1), the relation of p(T) and q(T) is given by 


p(T) _ Mio nt (ux) 


q(T) Ho nE (ut|æ) ' ae 


where 7! is the learner’s policy and nË is the expert’s policy. In this 
case, imitation learning methods based on the M-projection minimize 


Detar ir) = f ofr) Son Sl (2.32) 
= | æu) ln THE) dadu (2.33) 
= Eg(e,u) [In n” (ula) — Ina’ (ulx)], (2.34) 


where q(x, u) is the state action distribution induced by the trajectory 


distribution q(T) of the expert. Since E,|-] can be approximated using 
the trajectories drawn from q(T), minimization of the KL divergence 
in (2.34) can be solved using only the demonstrated trajectories. Early 
studies on imitation learning such as [Widrow and Smith, 1964, Pomer- 
leau, 1988] are based on this kind of supervised learning. However, these 
methods may not work well in many applications as indicated by [Ross 
et al., 2011, Bagnell, 2015]. 
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On the contrary, we could try to base imitation learning techniques 
on an I-projection [Amari, 2016] that minimizes 


Dra (p(7)|la(r)) = f ple, u) In de 


= E,[In n” (ula) — In 7” (ujæ)]. (2.36) 


However, it is hard to minimize Dxr (p(T)||¢(7)) in practice as we can 
not evaluate In 7? (u|æ), and there is no prior work on imitation learn- 
ing methods that minimize Dpi (p(7)||¢(7)) to the best of our knowl- 
edge. Exploring imitation learning methods based on I-projection will 
be an interesting research direction. Intuitively, the solution obtained 
by DAccrr [Ross et al., 2011] may result in a smaller KL-divergence 
under the I-projection than the one obtained by ordinary supervised 
learning as DAGGER attempts to achieve good performance under the 
learner’s own data distribution. 


3 


Behavioral Cloning 


In this chapter, we review behavioral cloning (BC) methods. BC meth- 
ods learn a direct mapping from states/contexts to trajectories/actions 
without recovering the reward function. Behavioral cloning can be an 
efficient way to reproduce the demonstrated behavior when such di- 
rect mapping is the most parsimonious way to represent the desired 
behavior. 

We start by reviewing model-free BC methods and continue by 
reviewing model-based BC methods, which leverage information about 
system dynamics. 


3.1 Problem Statement 


A controller for a robotic system often has a hierarchical structure. 
Figure 3.1 shows a control diagram of a robotic system with imitation 
learning. The upper-level controller plans the desired trajectory based 
on a given context and/or observations. Meanwhile, the lower-level con- 
troller determines the control input to achieve the desired trajectory. 
The main target of imitation learning for robotic systems is to learn 
these controllers. 
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Figure 3.1: Control diagram of a robotic system with imitation learning. An ex- 
pert demonstrates the desired behavior generating a dataset D. Based on D and 
observations about the current context and system state an upper-level controller 
generates the desired trajectory T°. A lower-level feedback controller tries to follow 
T? using observation feedback to generate a control u which causes a change to 
the system state x and a new observation. In imitation learning, the controllers are 
tuned to imitate the expert demonstrations. 


When learning trajectories, the aim of imitation learning is to learn 
a policy that generates a desired trajectory 


7? = n(s) (3.1) 


for a given context s. The context s can be the initial state of the 
robotic manipulator ao or the state of objects relevant to a given task. 
In action-state space learning, the aim is to learn a policy that generates 
a control input w for a given state x; and/or context s, 


Uz = T(E, 8). (3.2) 


In imitation learning, we assume that a dataset of experts’ demon- 
strations is available. When learning trajectories, the dataset consists 
usually of a set of trajectories and contexts D = {(7;, s;) }_,. In action- 
state space learning, the dataset will be given as a set of control inputs 
and states D = {(u;,a;)}_,. Using such datasets, a policy can be 
learned as the direct mapping from the context to the trajectory or 
from the state to the control input. This learning problem can be for- 
mulated as a supervised learning problem in which a policy can be 
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Algorithm 1 Abstract of behavioral cloning 


Collect a set of trajectories demonstrated by the expert D 
Select a policy representation To 

Select an objective function £ 

Optimize £ w.r.t. the policy parameter 0 using D 
return optimized policy parameters 0 


obtained by solving a simple regression problem. We call this approach 
“behavioral cloning”. Algorithm 1 abstracts the procedure of BC meth- 
ods. The first step of BC is to record a set of expert demonstrations 
D which are usually given as a set of trajectories. Thereafter, we need 
to select a policy representation 7g appropriate for a given application. 
In addition, we need to select an objective function £ that represents 
the similarity between the demonstrated behaviors and the learner’s 
policy. The policy parameters 0 are then optimized using the collected 
dataset of demonstrations. 


3.2 Design Choices for Behavioral Cloning 


In addition to the design choices we described in Chapter 2, we list 
here some essential design choices for BC methods. 


1. What surrogate loss function should be used to repre- 
sent the difference in demonstrated and produced behav- 
ior? BC methods require a surrogate loss function which quan- 
tifies the difference between the demonstrated behavior and the 
learned policy. The choice of the surrogate loss function influ- 
ences strongly how to train the policy, and we need to select the 
appropriate surrogate loss function to achieve efficient learning. 


2. What regression method should be used to represent the 
policy? To obtain satisfactory system performance, it is essential 
to select the appropriate regression method. The regression model 
should be sufficiently expressive to represent the desired behavior 
but simple enough to allow for efficient training of the model. For 
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efficient learning the regression method should be chosen together 
with the surrogate loss function. 


3.2.1 Choice of Surrogate Loss Functions for Behavioral 
Cloning 


We discuss some options for surrogate loss functions in this section. 


3.2.1.1 Quadratic Loss Function 


The quadratic loss function is the most common choice for the loss 
function. Given two vectors, x; and 2 1, a quadratic loss function is 
given by 


Causa (£1, £2) = (a1 = x2) | (ay = x2). (3.3) 


For example, the difference between the state « induced by the 
learner’s policy and the state 24°™° demonstrated by the expert can 
be formulated as 
bale”; gieo) _ (al e gehen T (gh — gee), (3.4) 

The quadratic loss function is also called the @2-loss function, and re- 
gression with minimizing the quadratic loss function is often called least 
squares (LS) regression or ¢2-loss minimization Sugiyama [2015]. 

Minimizing the quadratic loss function is closely related to maxi- 
mizing the expected log likelihood under the Gaussian distribution as- 
sumption. Let us consider the regression function fg(ax) parameterized 
by 0. Suppose that the target variable y follows the equation 


y = fo(w) +e, (3.5) 


where ¢ is drawn from the Gaussian distribution as € ~ M (0, ø). In this 
model, the probability distribution of y is given by 


p(y|x, 6) = | (y= fea) | 


(3.6) 
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Finding the model fg(a) that maximizes the expected log likelihood 
can be formulated as 


2\ 2 
argmax E|log p| = argmax E [le exp ( woe) | (3.7) 
0 


0 20 
= argmin E[(y — fo(æ))?] (3.8) 
~ argmin | S(y — fole)). (3.9) 


(3 


Therefore, minimizing the quadratic loss function is equivalent to maxi- 
mizing the expected log likelihood under the Gaussian distribution. BC 
methods such as DMP [Schaal et al., 2004, Ijspeert et al., 2013] and 
ProMP [Paraschos et al., 2013, Maeda et al., 2016] learn a trajectory 
representation by minimizing quadratic loss functions. 

Additionally, one can also use a weighted quadratic loss function 


l ranad (Ei T2, W) = (xı = x2)! W (a1 = £2) (3.10) 


when an appropriate weight W is known. For example, Mahalanobis 
distance [Mahalanobis, 1936] given by 


lanai (£1, £2) = (£1 — £2) ET (ay — £2), (3.11) 


where & is the covariance matrix of a distribution of interest, is often 
used in the literature [Rozo et al., 2016, Osa et al., 2017a]. 


3.2.1.2 -Loss Function 


The ¢,-loss function is often employed for regression. The @1-loss func- 
tion is given by 


labs(£1, £2) = DD |£1 i — T2il, (3.12) 


(3 


where x1; and x2; are the ith element of the vectors x; and æ2, re- 
spectively. The £,-loss function is also called the absolute loss function, 
and regression with minimization of ¢;-loss is called least absolute de- 
viations regression or €,-loss minimization Sugiyama [2015]. Usually, 
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f,-loss minimization is more robust to outliers than (9-loss minimiza- 
tion. This robustness can be attributed to the property of ¢;-loss min- 
imization, which gives the median of training samples, while (-loss 
minimization gives the mean of training samples. Effectively, in /2-loss 
minimization a few large outliers can influence the mean significantly 
while in @,-loss minimization the median can be largely unaffected by 
a few large outliers. In addition, unlike ¢2-loss minimization, ¢-loss 
minimization results in a sparse solution, which can be computation- 
ally efficient. Although, in imitation learning, there are not many prior 
studies on using ¢;-loss minimization, the discussed properties of the 
f;-loss could be beneficial. 


3.2.1.3 Log Loss Function 


The log loss function is defined by 
fogla; p) = — ) qi pi, (3.13) 


where q is the true probability and p is the predicted probability. In 
binary classification, the log loss function is given by 


liog(q, p) = —qlog p + (1 — q) log(1 — p). (3.14) 


Since the log loss is equivalent to the cross entropy, the log loss is also 
called the cross-entropy loss [Sugiyama, 2015]. 

In binary classification (in imitation learning classification can be 
used to learn a discrete control policy from expert demonstrations), 
minimizing the log loss function is equivalent to maximizing the log 
likelihood in logistic regression. In more detail, suppose that we want to 
learn a binary classification where the probability follows the Bernoulli 
distribution 


ply = 1|x, 0) = fo(x), p(y = O|a, 0) = 1— f(x), (3.15) 


which can be more compactly written as 


plylæ, 0) = (folz) (1 — fole). (3.16) 
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Maximizing the expected log likelihood E/log p| of Bernoulli distributed 


data follows then as 


max E[log p] = max Ely log fo(x) + (1 — y) log(1 — fo(x))] 
= max = > (vlog fo(@) + (1 — y) log(1 — fol) 
= min fog (y, fo(x)). (3.17) 


Therefore, in binary classification, minimizing the log loss function 


is equivalent to maximizing the expected log likelihood under the 
Bernoulli distribution. 


3.2.1.4 Hinge Loss Function 


Hinge loss is a loss function often used for maximum margin optimiza- 
tion in classifiers such as support vector machines (SVMs) [Cortes and 
Vapnik, 1995]. Given two scalar variables, zı and x2, the hinge loss can 
be defined as 


Lhinge(£1, £2) = max (0, 1 — ai). (3.18) 


Intuitively, the hinge loss assigns zero costs if a classification is correct: 
Lhinge(£1, £2) = 0. For “wrong” classifications the cost is linear w.r.t. the 
parameters. This also explains the term “hinge”; in a visual illustration 
of the cost function one can imagine a hinge at x1£2 = 1. While hinge 
loss is discontinuous at the “hinge” location z1x = 1, optimization 
solutions still exist in practice. Moreover, since the hinge loss function 
is convex, it can be optimized efficiently with various convex optimizers. 


3.2.1.5 Kullback-Leibler Divergence 


In the field of information geometry, Kullback-Leibler (KL) divergence 
is used to quantify the difference between two probability distribu- 
tions [Kullback and Leibler, 1951] 


De Wolle) = f ole) m Dae, (3.19) 


Since the KL divergence measures the difference between two prob- 
ability distributions, it is useful when learning stochastic policies. 
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Please note that the KL divergence is not symmetric, therefore 
Dx (p(x)||¢q(x)) = Dx (¢(x)||p(x)) does not hold in general. BC meth- 
ods such as [Englert et al., 2013] use the KL divergence as the loss 
function. 


3.2.2 Choice of Regression Methods for Behavioral Cloning 


When applying behavioral cloning, an appropriate regression method 
must be chosen. Table 3.1 lists regression methods found in the liter- 
ature. As discussed by Bishop [2006], one must choose a model that 
has appropriate complexity. Simple models which can be trained using 
linear regression are easy to train, but may not be sufficiently infor- 
mative. Complex models such as neural networks can represent highly 
nonlinear mappings. However, training such complex models requires a 
large amount of training data. In addition, it is important to note that 
imitation learning cannot be addressed as simple supervised learning 
in many applications as we discussed in §2.7. We discuss an approach 
for reducing imitation learning to supervised learning with interaction 
in §3.4.3. 


3.3 Model-Free and Model-Based Behavioral Cloning 
Methods 


As discussed in §2.3, BC methods can be categorized into model-free 
and model-based methods. Table 3.2 shows advantages and disadvan- 
tages of both model-free and model-based BC methods. 

Model-free BC methods learn a policy that reproduces the expert’s 
behavior without learning/estimating system dynamics nor recovering 
the reward function. Since model-free BC methods do not require learn- 
ing of system dynamics, model-free BC methods often do not require 
iterative learning and are relatively simple to implement compared to 
model-based BC methods. However, in trajectory learning, model-free 
BC methods do not ensure that the resulting trajectory is feasible in a 
given system. For this reason, it is hard to apply model-free methods to 
underactuated systems in which the set of reachable states is limited. 

Contrary to model-free BC methods, model-based BC methods 
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learn a policy using information about the system dynamics. By learn- 
ing forward dynamics, it is possible to plan a feasible trajectory close to 
the expert’s behavior even if a robotic system is underactuated. How- 
ever, in many applications, learning a forward model is a non-trivial 
problem. In addition, model-based BC methods often require iterative 
learning, which is usually time-consuming compared with learning with 
model-free BC methods. 


3.4  Model-Free Behavioral Cloning Methods in Action- 
State space 


In this section we discuss behavioral cloning methods in action-state 
space. Although it seems that simple supervised learning can work in 
imitation learning, such a naive approach does not work in many appli- 
cations. We will identify potential problems encountered when applying 


Table 3.1: Regression methods in model-free behavioral cloning for both trajectory 
and action-state space learning. The output trajectory in trajectory learning consists 
of a long high dimensional sequence of variables while in action-state space learning 
the output is a single action. Therefore, some methods such as look-up tables have 
not been applied to trajectory learning. For modeling uncertainty in demonstrations, 
regression methods need to have explicit support for variance. Gaussian model, 
GMM and GPR methods model uncertainty explicitly. 


[Paraschos et al., 2013, Maeda et al., 
2016] 
[Calinon and Billard, 2009, 
Gribovskaya et al., 2011, 


Gaussian Model 


Trajectory GMR Khansari-Zadeh and Billard, 2014, 
Learning Calinon, 2016] 
LWR [Schaal and Atkeson, 1998, Mülling 
et al., 2013, Osa et al., 2017a] 
LWPR [Vijayakumar et al., 2005] 
GPR [Osa et al., 2017b] 


Look-Up Table 


[Chambers and Michie, 1969] 


Linear Regression 


[Widrow and Smith, 1964] 


Action-State 


Neural Network 
Space 


[Pomerleau, 1988, LeCun et al., 2006, 
Stadie et al., 2017, Duan et al., 2017] 


Decision Tree 


[Sammut et al., 1992] 


LWR 


[Atkeson and Schaal, 1997] 


LWPR 


[Vijayakumar and Schaal, 2000] 
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supervised learning to imitation learning and discuss how we can alle- 
viate these problems. 


3.4.1 Model-Free Behavioral Cloning as Supervised Learn- 
ing 


Early studies on imitation learning such as [Widrow and Smith, 1964, 
Chambers and Michie, 1969, Pomerleau, 1988] employed supervised 
learning methods for imitation learning in action-state space. Among 
such early studies, in the seminal work ALVINN (Autonomous Land Ve- 
hicle In a Neural Network), Pomerleau [1988] developed an autonomous 
driving system using imitation learning. Pomerleau [1988] collected 
pairs of camera images and steering angles and trained a a neural net- 
work that modeled a direct mapping from camera images to steering 
angles. However, this simple approach can fail in practice and the au- 
tonomous car drives off the road quickly. As Bagnell [2015] indicated, 
learning errors cascade in sequential decision making, which makes the 
learner encounter unknown states that the expert never encounters in 
her/his successful demonstrations. Pomerleau [1988] described “If the 
network is not presented with sufficient variability in its training exem- 


Table 3.2: A main choice when doing behavior cloning is whether to use a model- 
based or a model-free method. Model-free methods can directly learn a policy from 
data without learning a dynamics model. Direct learning also usually means that 
the learning algorithm does not need to iterate between trajectory and behavior gen- 
eration. However, model-free methods are hard to apply to underactuated systems 
since without a model predicting desired behavior is hard. Model-based methods 
may work in underactuated systems but learning the actual model can be in many 
cases difficult. 


Model-free 
A policy can be usually 


Model-based 


Applicable to underactu- 


Hard to predict future 
states. 


Advantages learned without iterative 
; ated systems. 
learning. 
Hard to apply to underac- | Model learning can be 
Disädvantages tuated systems. very difficult. 


An iterative learning pro- 
cess is often required. 


56 Behavioral Cloning 


plars to cover the conditions it is likely to encounter when it takes over 
driving from the human operator, it will not develop a sufficiently robust 
representation and will perform poorly. In addition, the network must 
not solely be shown examples of accurate driving, but also how to recover 
(i.e. return to the road center) once a mistake has been made.” That 
is, the distribution of the states that the learner encounters is different 
from the distribution of the states in the given demonstration data. Su- 
pervised learning is usually based on the assumption that training data 
samples are independent and identically distributed. However, this as- 
sumption is often violated in an imitation learning problem, especially 
when a policy for sequential decision making needs to be learned. To 
address this issue, Ross and Bagnell [2010], Ross et al. [2011] proposed 
an approach which reduces imitation learning to supervised learning 
with interaction, which we discuss in § 3.4.3. 


3.4.2 Imitation as Supervised Learning with Neural Net- 
works 


Using neural networks for learning has attracted great interest in vari- 
ous fields. Supervised learning of neural networks can be also used for 
imitation learning: the desired neural network policy can be learned 
from the dataset generated/demonstrated by the expert. In this sec- 
tion, we shortly highlight some recent imitation learning successes with 
neural networks. 


3.4.2.1 Recent Successes of Imitation Leaning with Neural 
Networks 


Recently, using neural networks for imitation learning has shown im- 
pressive results in certain applications such as learning to play Go [Sil- 
ver et al., 2016], generating handwriting [Chung et al., 2015], gener- 
ating natural language [Wen et al., 2015], or generating image cap- 
tions [Karpathy and Fei-Fei, 2015]. Moreover, supervised learning of 
neural networks has been used as a building block for example for 
learning the policy or the cost function in inverse reinforcement learn- 
ing (please see §4.4.6 for more details). 
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Figure 3.2: The game of Go is played on a 19x19 board. Even though the total 
number of possible board configurations exceeds 10’ and thus the training data can 
not cover all possible plays, the simple imitation learning approach in [Silver et al., 
2016] was able to learn a competitive policy from demonstrations and improve the 
policy using self-play. [Figure from https: //commons.wikimedia.org/wiki/File: 
Tuchola_026. jpg. CC license.] 


Supervised imitation learning can be challenging when demonstra- 
tions do not cover the states that the learner encounters. For some 
applications, such as board games where the state space is known in 
advance, demonstrations could in principle be made to cover the state 
space. However, for example in the game of Go shown in Figure 3.2, 
the set of possible states is too large to cover completely and the su- 
pervised training approach needs to be able to generalize from training 
data. AlphaGo, an algorithm which was able to beat a human Go mas- 
ter [Silver et al., 2016], succeeded to learn a competitive Go policy 
using supervised imitation learning and then improve the policy using 
reinforcement learning. 

AlphaGo trains a value network, which approximates the value 
function to predict the expected outcome of the game, and a policy 
network, which outputs actions using a representation of the image in- 
put of the board. The policy network is initialized by supervision using 
a large set of expert demonstrations, in total 30 million positions from 
the KGS Go Server. The value and policy networks are improved using 
data collected through self-play. AlphaGo selects actions by evaluating 
them with the policy and value networks. 
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The trained policy is a 13-layer deep neural network with alternat- 
ing convolutional layers and rectifier nonlinearity layers, and the output 
is a soft-max layer resulting in a probability distribution over actions. 
The neural network receives as input a representation of the board 
state. For supervised training of the policy AlphaGo uses stochastic 
gradient ascent to maximize the likelihood of expert demonstrations 


Glog role le) where ^ô is the change in pa- 


w.r.t. parameters 0: A0 x 
rameters, Uus is the expert action and æ+ is the state. AlphaGo also 
utilizes a smaller, less accurate, but faster policy for predicting the 


expected outcome of actions. 


3.4.2.2 Learning with Recurrent Neural Networks 


In many applications, supervised learning of recurrent neural networks 
has made imitation learning of complex time series predictions possible. 
Wen et al. [2015] show how to generate human like natural language 
using a special form of the long short-term memory (LSTM) [Hochreiter 


Table 3.3: Natural language generated by the semantically controlled LSTM (SC- 
LSTM) cell neural network proposed in [Wen et al., 2015]. The table shows an 
example dialogue act and related natural language samples from [Wen et al., 2015]. 
The neural network generates natural language learned from human demonstrations. 
The neural network is conditioned on the dialogue act which limits the generated 
sentences to specific meanings. 


Dialogue act: 
inform(name="red door cafe”, goodformeal=”breakfast”, 
area="cathedral hill”, kidsallowed=”no”) 


Generated samples: 

red door cafe is a good restaurant for breakfast in the area 
of cathedral hill and does not allow children . 

red door cafe is a good restaurant for breakfast in the cathedral hill 
area and does not allow children . 

red door cafe is a good restaurant for breakfast in the cathedral hill 
area and does not allow kids . 

red door cafe is good for breakfast and is in the area of cathedral hill 
and does not allow children . 

red door cafe does not allow kids and is in the cathedral hill area 
and is good for breakfast . 
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and Schmidhuber, 1997] network. Wen et al. [2015] train their system 
using data collected from a spoken dialogue system. Table 3.3 shows an 
example of natural language generated by the trained neural network. 

As is common when designing neural network based systems, the 
neural network architecture in [Wen et al., 2015] is adapted to the 
task at hand. Moreover, neural network approaches need to take prob- 
lems such as vanishing gradients, co-adaptation, and overfitting into 
account. Vanishing gradients can be a problem especially for recurrent 
neural networks due to the high optimization depth. The neural net- 
work architecture in [Wen et al., 2015] includes skip connections [Graves 
et al., 2013] to soften vanishing gradients and Wen et al. [2015] utilize 
dropout [Srivastava et al., 2014], a technique which randomly deac- 
tivates connections in the neural network during training, to reduce 
co-adaptation and overfitting. 

Learning recurrent neural networks from demonstrations has been 
shown to work also for other kinds of data. Karpathy and Fei-Fei [2015] 
show how to learn to generate annotations for image regions from 
demonstrations. The approach of [Karpathy and Fei-Fei, 2015] learns 
from a combination of image and language data to generate natural lan- 
guage descriptions of images. Chung et al. [2015] show how to learn to 
generate handwriting and natural speech from demonstrations. Chung 
et al. [2015] propose a new type of recurrent neural network with hid- 
den random variables and argue that random variables are needed to 
model variability in data with complex correlations between different 
time steps, for example, in natural speech. 


3.4.3 Teacher-Student Interaction during Behavioral 
Cloning 


Although the goal of imitation learning is to learn a policy that repro- 
duces the expert’s behavior, any learned policy will inevitably make at 
least occasional mistakes. As a result small error may cascade [Bagnell, 
2015]: a small error at an early time-step may lead the learner to a state 
which deviates from expert demonstrations. Consequently, the learner 
will make further mistakes, leading to poor performance. 

This highlights a central difference between imitation learning and 
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the traditional setting of supervised learning, where we typically as- 
sume the input distribution to be independent and identically dis- 
tributed [Shalev-Shwartz and Ben-David, 2014]. Instead, in imitation 
learning, the features/states in a dataset of demonstrations are not 
drawn from the distribution of the features which the learner will en- 
counter using their own policy. This means that the assumption of 
independent and identically distributed (i.i.d.) data is often violated 
in imitation learning. Crudely speaking, a policy for recovering from 
mistakes needs to be learned as suggested by Pomerleau [1988]. 
However, in even modest-scale imitation learning problems it is in- 
feasible to collect demonstrations under all possible situations, and in- 
stead we must focus corrections to the most relevant scenarios. Instead, 
a policy can be iteratively learned by alternating between policy up- 
dates and requesting additional demonstrations for the current state 
distribution [Ross and Bagnell, 2010, Ross et al., 2011, Bagnell, 2015]. 
We review methods that address this problem in the following section. 


3.4.3.1 Reduction of Structured Prediction to Iterative 
Learning of Simple Classification 


The task of learning a function that maps inputs x to structured out- 
puts y (for example, parse trees, trajectories, matchings, etc. [Taskar, 
2005]) is referred to as structured prediction [Tsochantaridis et al., 
2005, Baklr et al., 2007]. Problems of imitation learning can often prof- 
itably be phrased as structured prediction [Ratliff et al., 2006b,a, 2009], 
and has led to developments of some techniques we cover extensively 
in this survey in § 4.4.2. 

Conversely, Search-based structured prediction (SEARN) proposed 
by Daumé III et al. [2009], is a seminal work that demonstrated that 
one can also reduce structured prediction to a kind of imitation learn- 
ing. In particular, SEARN crafts a series of reductions from structured 
prediction to simple classification. In SEARN, structured prediction is 
formulated as a search process over the components yt of the structured 
output y, where the tth decision is dependent on the preceding t — 1 
decisions. Therefore, the training process of a classifier in SEARN is 
dependent on the classifier itself. 
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SEARN learns a multiclass cost-sensitive classifier, e.g. [Zadrozny 
et al., 2003], for each state in the dataset through an iterative process. 
By performing the prediction using the current classifier 7, SEARN cre- 
ates new cost-sensitive samples. These cost-sensitive samples are used 
to learn a new classifier 7’ which SEARN combines with the current 
classifier h in a stochastic manner. Daumé III et al. [2009] show that 
the performance of SEARN is competitive with other methods such 
as structured SVM [Tsochantaridis et al., 2005] and Conditional Ran- 
dom Field [Lafferty et al., 2001], while often being tremendously faster 
to learn. Modern, high performance, implementations of such search 
based structured prediction use online learning methods of DAGGER 
[Ross et al., 2011, 2013], AggreVaTe [Ross and Bagnell, 2014, Sun et al., 
2017], or LOLS [Chang et al., 2015, Daumé III and Langford, 2015}. 

However, for each time step, simple implementations of these search 
based structured prediction require a state reset and an expert demon- 
stration. Such a reset is often infeasible in the physical world, and 
even if possible, the expert may need to provide a prohibitively large 
number of demonstrations. For these reasons, SEARN, AgegreVaTe and 
LOLS require substantial care to implement efficiently, using e.g. ban- 
dit methods or value regression, and deal with resets [Chang et al., 
2015). 


3.4.3.2 Confidence-Based Approach 


Chernova and Veloso [2009] proposed a method that learns a policy by 
requesting additional expert demonstrations based on the confidence 
of a given state. In this method, the learner learns how to select the 
action from a finite set of action primitives by using classifiers that 
return selection confidence, e.g. Gaussian mixture models. When the 
confidence is lower than a threshold, additional expert demonstrations 
are requested. In addition, when the expert observes incorrect actions 
by the learner, the expert corrects the action and the corrected action is 
added to the training dataset. By requesting additional demonstrations, 
this method also tries to empirically learn a policy under the state 
distribution induced by the learner’s policy. 
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Algorithm 2 Confidence-based autonomy algorithm: confident execu- 
tion and corrective demonstration [Chernova and Veloso, 2009] 


Input: Demonstration of the action-state pairs D = {(a;,u;)}_,, 
confidence threshold co 
Initialize the policy a 
repeat 
Observe the state x 
Compute the confidence c(x) 
Plan action ul 
if c(a) < cp or Corrective demonstration is necessary then 
Receive the demonstration data Dpew = {(x™°”, un”) } 
Update the dataset D + D U Drew 
Update the policy 7} 
end if 
until the task learned 


3.4.3.3 Data Aggregation Approach: DAGGER 


Ross et al. [2011] proposed an meta-algorithm called DAGGER , which 
attempts to collect expert demonstrations under the state distribution 
induced by the learned policy. It can be seen most naturally as an on- 
policy approach t [Sutton and Barto, 1998] to imitation learning: the 
expert provides the correct actions to take, but the input distribution 
of examples comes from the learner’s own behavior. 

Figure 3.3 shows an overview of the DAGGER approach to imita- 
tion learning. The simplest form of DAGGER proceeds as follows. At 
the first iteration, the policy is initialized by behavioral cloning of the 
expert demonstrations, resulting in policy nF. Subsequently, the policy 
is used to collect a dataset of trajectories, and those newly obtained 
trajectories and the demonstrated trajectories are aggregated into a 
dataset D, which is used to train a policy wy. At iteration n, a pol- 
icy tl is used to collect more trajectories, and those trajectories are 


'The first using of the phrasing of on-policy, which nicely evokes the closely 
related approaches and issues in Reinforcement Learning is due to [Laskey et al., 
2017]. 
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Figure 3.3: An overview of DAGGER from [Bagnell, 2015]. In each iteration, 
DAGGER generates new examples using the current policy with corrections (labels) 
provided by the experts, adds the new demonstrations to a demonstration dataset 
and computes a new policy to optimize performance in aggregate over that dataset. 
The figure illustrates a single iteration of DAGGER . The basic version of DAGGER 
initializes the demonstration dataset from a single set of expert demonstrations and 
then interleaves policy optimization and data generation to grow the dataset. More 
generally, there is nothing special about aggregating data- any method, like gradient 
descent or weighted majority that is sufficiently stable in its policy generation and 
does well on average over the iterations (or more broadly, all no-regret algorithm run 
over each iterations dataset) will achieve the same guarantees, and maybe strongly 
preferred for computational reasons. 


added to the dataset D. The next policy mẸ}; is trained so that 7h, 
mimics the expert on the whole dataset D. To leverage the presence of 
the expert, DAGGER queries partial expert demonstrations 7” in the 
learning phase, and the policy m; = A;n? + (1 — 6;)xb— a stochas- 
tic mixing of expert and learner- is used to collect the next dataset. 
In other words, partial expert demonstrations are requested under the 
states induced by the learned policy m}. Thus, DAGGER learns a policy 
from the expert demonstrations under the state distribution induced 
by the learned policy. Algorithm 3 shows the details of the general 
DAGGER algorithm. 
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Algorithm 3 DAGGER [Ross et al., 2011] 


Input: initial dataset of demonstrations D = {(a,u)}, {G;} such 
that 4 DÈ Bi 3 0 
Initialize: +b 
for i= 1 to N do 
Let Ti = Bin! + (1 = Biyrt. 
Sample trajectories T = [xo, Uo, ..., £r, UT] using 7; 


Get dataset D; of visited states by 7; and actions given by expert. 
Aggregate datasets: D + DUD; 
Train the policy 7}; on D. 

end for 

return best 7} on validation. 


By collecting the expert demonstrations under the state which the 
learner encountered, DAGGER alleviates the problem that the state dis- 
tribution induced by the learner’s policy is different from the state dis- 
tribution in the initial demonstration data. This approach significantly 
reduces the size of the training dataset necessary to obtain satisfactory 
performance[Ross et al., 2011], and often achieves much better perfor- 
mance even asymptotically. DAGGER can be interpreted as a reduction 
of imitation learning to supervised learning with interaction Bagnell 
[2015]. 

Crucially, the DAGGER approach is not limited to naive aggre- 
gation of all previous data: in fact, any algorithm (like gradient de- 
scent, some variants of newton’s method, the exponentiated gradient 
approach, etc.) that enjoys the property of being no-regret can be used 
to learn iteratively on each newly collected data-set, and achieve the 
related formal guarantees. In practice, for instance, training complex 
policies with substantial training data is often based on online learning 
approaches like gradient descent.2 We can think crudely of no-regret 
algorithms as the class of methods whose predictions are asymptoti- 


?Note it is not technically correct to refer to these as Stochastic Gradient Descent 
(SGD) methods because the data being generated is not independent and identically 
distributed. Instead, the more general analysis of Online Gradient Descent [Hazan, 
2016] is required. 
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cally good on average over the data-sets they are presented, and are 
sufficiently stable between iterations [Hazan, 2016]. 

Data as Demonstrator: Venkatraman et al. [2015] extended 
DAGGER and proposed a framework called Data as Demonstrator 
(DaD) where the problem of multi-step prediction is formulated as im- 
itation learning. Prediction errors will cascade over time in multi-step 
prediction as in the case of learning a policy, and this prediction error 
can also be improved by a data aggregation approach. Recent work 
shows the efficacy of DaD in control problems [Venkatraman et al., 
2016]. 


3.5 Model-Free Behavioral Cloning for Learning Trajec- 
tories 


In this section, we review approaches to learn trajectories from demon- 
strations. In robotic manipulation, trajectory planning is one of the 
most significant problems. If we assume that the system is (nearly) 
fully actuated and that a low-level controller to achieve the desired 
state is available, a trajectory for a given task can be learned without 
explicitly estimating the system dynamics. Since many commercial- 
ized robotic manipulators usually have such low-level controllers, this 
model-free BC approach has been dominant in imitation learning re- 
search for robotic manipulator trajectory planning. Next, we show how 
the choice of the trajectory representation influences trajectory learning 
and how the representation needs to fit to the application at hand. 


3.5.1 Trajectory Representation 


In order to learn trajectories we first need to define how to represent 
a trajectory. The choice of trajectory representation determines the 
parameterized space where demonstrated trajectories are projected. 
Therefore, it is essential to figure out the most parsimonious repre- 
sentation for a given application. 

For planning a desired trajectory, we need a policy that generates a 
trajectory T € T. The trajectory is given by a sequence of desired states 
and/or control inputs based on a given context s € S. Given a set of 
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demonstrated trajectories D = {(s;,7;)}_,, we can use a supervised 
learning method to learn a policy which directly maps from contexts 
to trajectories 


nial Ts. (3.20) 


For this purpose, we can use various regression methods developed in 
the field of machine learning. For example, Calinon et al. [2007] em- 
ployed Gaussian mixture regression to model a mapping from time 
to states, and Osa et al. [2017b] used Gaussian Process regression for 
learning a mapping from contexts to trajectories. For learning such poli- 
cies the choice of methods is usually not limited to specific regression 
methods, and we can also employ various machine learning techniques 
such as dimensionality reduction [Sugiyama, 2015] in order to alleviate 
the challenges of trajectory learning. 

However, when planning a desired trajectory for a robotic system 
we need to ensure that the planned trajectory is physically feasible and 
a naive application of regression may not be the best choice. It is often 
necessary to impose some constraints on the planned trajectory, such as 
smooth convergence to the goal state. Such constraints may be implic- 
itly satisfied when regression methods are used to learn a policy, but 
it is often convenient to use a policy that explicitly satisfies some con- 
straints. Dynamic movement primitives (DMPs) [Schaal et al., 2004, 
Ijspeert et al., 2013] and the stable estimator of dynamical systems 
(SEDS) approach [Khansari-Zadeh and Billard, 2011] are representa- 
tions that explicitly satisfy the condition of smooth convergence to the 
goal state. For learning these policies, regression methods are used in 
specific ways such that the desired constraints are satisfied. In the fol- 
lowing, we discuss the details of different trajectory representations. 


3.5.1.1 Keyframe/Via-Point Based Approaches 


One obvious way to represent trajectories is as a sequence of keyframes 
or via-points. In the field of computer graphics, the term “keyframe” 
is used to express important states which are needed for accomplish- 
ing a given task [Parent, 2002]. In a keyframe-based approach, a task 
trajectory is represented as a sequence of keyframes. In robotic motion 
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planning literature, the term “via-point” is used similarly to the term 
keyframe [Pastor et al., 2009, Paraschos et al., 2013, Zucker et al., 2013]. 
Instead of using the terms “keyframe” and “via-point”, several articles 
describe a trajectory as consisting of a sequence of discrete states [Lee 
and Nakamura, 2009, Takano and Nakamura, 2015]. 

A keyframe-based trajectory representation appears in several im- 
itation learning applications. Nakaoka et al. [2007] developed a hu- 
manoid system that learns dancing from human expert demonstration 
using a keyframe-based approach. The motion of the human expert 
was captured by a 3D motion tracking system, and the keyframes 
were subsequently extracted. By modifying the keyframes according 
to the dynamics of the humanoid, the humanoid was able to perform 
the demonstrated dance properly. Okamoto et al. [2014] developed a 
system that can perform a dance synchronously to music with differ- 
ence rhythms by learning the correspondence between the music and 
the dancing motion. 

Trajectories can be represented using discrete states. For discrete 
states one natural dynamics and observation model representation is 
the hidden Markov model which we will discuss next. 


3.5.1.2 Representation with Hidden Markov Models 


A hidden Markov model (HMM) is often used to model the proba- 
bilistic transition between discrete states [Inamura et al., 2004, Kulić 
et al., 2008, Lee and Nakamura, 2009, Takano and Nakamura, 2015]. 
A discrete HMM consists of a finite set of latent states X, a finite 
set of observation labels Y, a state transition matrix A = {aj}, an 
output probability matrix B = {b;;}, and an initial distribution vector 
d;. When an HMM is used to represent motion, the latent state often 
represents the phase of the motion, and the observation represents the 
kinematic state of the system. Given a set of observation sequences 
and the set of states, A and B can be obtained by the Baum-Welch 
algorithm, which is a variant of the Expectation-Maximization (EM) 
algorithm. Once A and B are trained, a motion sequence can be esti- 
mated for a given initial state. 

One of the benefits of an HMM representation is the ability to 
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recognize the current system state based on the learned probabilistic 
model. HMMs have been used in classical speech recognition [Rabiner, 
1989], and motion recognition can be performed in the same manner 
using HMMs [Inamura et al., 2004, Takano and Nakamura, 2015]. Given 
an HMM å = (A, B) and an observation sequence Y”, the likelihood of 
observing a given sequence p(Y"|A) can be computed. Therefore, the 
observed motion can be recognized as 


r* = arg max p (YA). (3.21) 


In the framework in [Inamura et al., 2004], HMMs are used to represent 
primitive motions. The library of primitive motions are represented by 
a set of HMMs, and the motion is recognized based on the likelihood as 
in (3.21). This framework is extended to clustering and segmentation 
of demonstrated trajectories in [Kulié et al., 2008, Lee and Nakamura, 
2009, Lee et al., 2010, Takano and Nakamura, 2015, 2016]. 

On the other hand, one of the drawbacks of the HMM representation 
is discreteness. Recognition with HMMs works well when the number 
of states is relatively low [Kulié et al., 2008]. However, HMMs with 
too few states may not be capable of reproducing a motion sequence. 
In robotic applications, HMMs are often used to represent the discrete 
high-level state of the system, assuming a low-level controller to achieve 
the desired state is available. However, it is non-trivial to plan smooth 
and feasible trajectories in many robotic systems. 

To overcome the discreteness of HMMs, recent work uses other 
techniques in combination with HMMs, such as state specific Gaus- 
sian models [Calinon et al., 2010] to represent continuous values such 
as velocity, spatial position, or force [Racca et al., 2016]. Recent work 
also uses Hidden Semi-Markov Models (HSMM) [Yu, 2010] to model 
more complex state duration distributions [Calinon et al., 2011]. The 
work by Rozo et al. [2016] employs an LQR controller to address the 
problem of optimizing a trajectory retrieved from an HSMM. Addition- 
ally, Takano and Nakamura [2017] recently proposed an HMM-based 
method for planning joint torques to control the contact force. 
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Figure 3.4: Schematic illustration of DMP. DMPs represent the demonstrated 
motion as a combination of a nonlinear force term and an attractor force term. Blue 
and red points represent the start and goal positions, respectively. Suppose that 
the trajectory shown in (a) is given as a demonstrated trajectory. The nonlinear 
force term along the trajectory, which is dependent on the phase of the motion, is 
shown as orange vectors, and in (b) green vectors represent the attractor force term, 
which is stationary and dependent on the state of the system. The dynamics of the 
demonstrated motion is learned as a sum of these terms shown in (c). 


3.5.1.3 Dynamic Movement Primitives 


Dynamic Movement Primitives (DMPs) were introduced by Ijspeert 
et al. [2002a,b], Schaal et al. [2004], Ijspeert et al. [2013]. DMPs are 
motivated by differential equations of well-defined attractor dynamics. 
Representation with DMPs ensures the smoothness and continuity of 
the trajectory. In addition, a DMP is able to represent nonlinear move- 
ments without losing the stability of the behavior. Figure 3.4 shows a 
schematic illustration of DMPs. DMPs represent demonstrated motion 
with a combination of a nonlinear force term and an attractor force 
term. The nonlinear force term enables expressing complex motions. 
Since the nonlinear force term decays in time, the goal attractor force 
term is dominant in the end of the motion and a path planned by a 
DMP smoothly converges to the goal state. 

We describe details of DMPs in the following. In a DMP, the demon- 
strated motion with one degree of freedom (DoF) is modeled as a 
spring-damper system 


TË = as (Balg — 2) Ti) +f, (3.22) 


where x is the state of the system, f is the forcing function that deter- 
mines the nonlinear behavior, a, and y; are constants that determine 
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the damping and spring behavior, respectively. 7 is a constant that de- 
termines the temporal behavior, and g denotes the goal state. In the 
example shown in Figure 3.4, the forcing function f and the goal at- 
tractor term a;3,(g — x) are visualized in Figure 3.4(b). In imitation 
learning with DMPs, we can often assume that the final state tdemo(T) 
of the demonstrated motion is the goal state g = ®gemo(T). 

One significant feature of DMPs is time modulation by using a 
phase variable. By choosing the appropriate form of the basis function 
of the forcing function and the phase variable, DMPs can represent var- 
ious movements with different execution speeds [Ijspeert et al., 2013]. 
Let us denote by z a phase variable. For a striking movement, one can 
introduce the phase variable that follows the first-order linear dynamics 
as 

TŽ = —Qz2, (3.23) 
where a, is a constant. Ijspeert et al. [2013] called this equation the 


canonical system because it models the generic behavior of the system. 
In this case, the phase variable z is given by a function of time t 


z = z0 exp (-=:) ; (3.24) 
T 


where 2g is the initial value of z. The phase variable z exponentially 
converges to zero from an arbitrary initial state. Typically, the phase 
variable z is used as z € [0,1] for a striking movement. 

The forcing function that models the nonlinear behavior is learned 
as a function of the phase variable z. Using a Gaussian basis function 
with this phase variable z, the forcing function can be formulated as 


M 
f(z) = (g — zo) 5 pilz)wiz, (3.25) 
i=1 
where xo denotes the initial position and M the number of the basis 
functions. The Gaussian basis function 7;(z) is given by 


exp (—hi(z — c)”) 
Kuen h= ¢)?)' 


where h; and c; are constants that determine the width and centers of 


ilz) = (3.26) 


the basis functions, respectively. This system represents stable attractor 
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Algorithm 4 Learning dynamic movement primitives [Schaal et al., 
2004, Ijspeert et al., 2013] 


Input: demonstrated trajectory rimo, parameters Qg, Bx, T, Az, Wz 
Choose a system of a phase variable z, e.g., (3.23) 

Choose a basis function w of the forcing function f 
Compute the forcing function at each time step using T 
(3.27) 

Find a weight vector w that minimize Lpmpr in (3.28) using a least- 
square solution (3.29) 


demo with 


dynamics with nonlinear behavior. DMPs can be also used to represent 
rhythmic movements by using periodic basis functions [Schaal et al., 
2004, Ijspeert et al., 2013]. 


If we assume that a demonstrated trajectory 74°™° 


is given, the 
weight vector w can be learned as a supervised learning problem [Schaal 
et al., 2004, Ijspeert et al., 2013]. From the given trajectory, we compute 
the position, velocity and acceleration at each time step. To obtain the 
weight parameters in a DMP, we compute the target value of the forcing 
function from the given trajectory as 


Sistas (t) = = 72 7demo (t) — az (2=(9 = phere) = ead) l (3.27) 
where gamo (+), etm (t), ¢4°™°(¢) are the position, velocity and accel- 


eration at the time t, respectively. Subsequently, we can find the weight 
vector w that minimizes the sum of the squared error 


T 
LDMP D frarget(t) — &(t E(t) Ww)’, (3.28) 
t=0 


where E(t) = (g — x) z(t) for the discrete system and €(t) = 1 for the 
rhythmic system, and the entry of W is computed as V,; = y; (tj) with 
(3.25). The weight vector w can be obtained by a least-square solution 


= (wa) YF. (3.29) 
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For the attractor dynamics in (3.25), F is given by 


= _Frarget (0) _ target (t) frarget (T) 
(g — xo)z(0) (g — zo)z(t) (g — zo)z(T)] ’ 


where T is the number of the total time steps. Algorithm 4 summarizes 


(3.30) 


the procedure for learning DMPs. Since DMPs are primarily designed 
for learning a motion for a single degree of freedom, multiple DMPs 
need to be learned for each dimension when learning motions with mul- 
tiple dimensions. 

Variants of Dynamic Movement Primitives: Since DMPs have 
been proposed, numerous variants of DMPs have been developed. Hoff- 
mann et al. [2009] proposed an extended version of DMPs for obstacle 
avoidance and real-time goal adaptation. Denisa et al. [2016] developed 
Compliant Movement Primitives (CMPs) for learning compliant mo- 
tions that require physical interaction between a robot and objects. For 
learning coupled motions, several variants of DMPs have been proposed 
by Kober et al. [2008], Gams et al. [2014], Amor et al. [2014]. Miilling 
et al. [2013] proposed a Mixture of Movement Primitives (MoMPs), 
which generalize the movement primitives to new contexts by mixing a 
set of learned movement primitives. DMPs have been applied to various 
robotic tasks and recognized as one standard representation of robotic 
motions. 

Relation to Hilbert Norm Minimization: Dragan et al. [2015] 
revealed the relation between DMP-like methods and trajectory opti- 
mization based on Hilbert norm minimization such as CHOMP Zucker 
et al. [2013]. Dragan et al. [2015] formulated the problem of adapt- 
ing a demonstrated trajectory T4°™° to new start and goal states as 
minimization of the distance between the demonstration and the new 
trajectory subject to the new start and goal point constraints: 


* : demo ? 
T* = arg min |- — rl, (3.31) 
6.20) =a," (3.32) 
a) =o (3.33) 


where «5° and w°™ are the new start and goal states, M is a linear 


operator that defines the inner product in the Hilbert space. When time 
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is discrete, M is a matrix, and the norm is given by ||T||, = 7' Mr. 
This formulation can be generalized to arbitrary norms, and Dragan 
et al. [2015] prove that trajectory adaptation with DMPs performs 
this norm minimization with a particular choice of the Hilbert norm, 
which is the same as the norm often used in trajectory optimization 
algorithms such as CHOMP Zucker et al. [2013]. 


3.5.1.4 Probabilistic Movement Primitives 


While DMPs represent the movement in a deterministic way, demon- 
stration performed by human experts is often stochastic. Such proba- 
bilistic behavior cannot be represented by DMPs. Probabilistic Move- 
ment Primitives (ProMPs) proposed by Paraschos et al. [2013] rep- 
resent movement as a distribution over trajectories. In ProMP, the 
trajectory is parameterized as a linear combination of basis functions 
w(t). The state of the system a(t) at time t is expressed as 


a(t) = | qt) | = P(t) Tw + er, (3.34) 


where W(t) is a M x2 dimensional time-dependent basis matrix defined 
as , 
p(t) W(t) 
W(t) = (3.35) 
du(t) hult) 
w is a weight vector, and €s ~ N(0,™,,) is zero-mean i.i.d. Gaussian 
noise. Here, the probability of observing the state a(t) is expressed as 


p(2(t)|w) = N (EHP) w, Z). (3.36) 
Thus, the probability of observing the whole trajectory T = 
[x(0),...,2(T)] is written as 
p(r|w) =T]V (x) BO)", En) (3.37) 
t 


By introducing a phase variable z(t), we can achieve temporal mod- 
ulation in ProMP. The phase variable is defined as z(0) = 0 at the 
beginning of the movement and as z(T) = 1 in the end. The basis 
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function directly depends on the phase variable by replacing y(t) with 
w(2(t)) and ýtt) = YZP. 

The basis function should be selected according to the type of 
the movement as in DMPs. For point-to-point movements, one typi- 
cal choice is a Gaussian function bS, that is, 


zZ = Ci 2 
bf (2(t)) = exp (a) (3.38) 


where h defines the width of the basis function and c; is the center 
for the ith basis function. For rhythmic movements, the Von-Mises 
function can be used to model periodicity. 

For imitation learning, the weight vectors w and the covariance 
matrix Xy need to be learned from the demonstrated trajectories. This 
problem can be formulated as a simple supervised learning problem. 
Let us assume that the trajectories demonstrated by experts are given 
as D = [r!,...,7/¥]. If we assume that the demonstrated trajectories 
are aligned properly in the time domain, a weight vector w’ for the 
ith demonstrated trajectory can be obtained by minimizing the sum of 
squared errors 


2 
, 


LProMP = > æ) = H(t)" w| (3.39) 
t=0 


where x(t) = [q(t) q(t)]'. The solution is given by a least squares 
solution 


wi = ri | Df, (3.40) 


where the basis function matrix [ is given by 


Wr(0) H0) = y(T) 4T) 
b | 


T= : : : (3.41) 
| Ym(0) du(0) ... Ym(T) du (T) | 

For each demonstrated trajectory, we obtain a weight vector and for 
the whole set of demonstrated trajectories D we obtain a set of weight 
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Algorithm 5 Learning probabilistic movement primitives [Paraschos 
et al., 2013] 


Input: Multiple demonstrated trajectories D = {r¢emo} N] 


Choose a basis function y and the number of the basis function M 
Compute the basis function matrix W(t) 
for each demonstrated trajectory do 
Obtain w by computing (3.40) 
end for 
Compute p(w) ~ N (Hu, Bw) 


vectors Q = [w',...,w]. From the set of weight vectors Q we can 


estimate a distribution over the weight vectors p (w) ~ N (Hw, Bw). The 
distribution of the state at time t can be modeled as 


p(a(t)) =N (a(t) wo) a, D(t)' OW(t) + Be). (3.42) 


Algorithm 5 summarizes the procedure for learning ProMPs. 

One of the characteristic features of ProMPs is the conditional 
distribution of the weight conditioned on a sequence of states x* = 
[x(t),...,a(t’)]. When æ* is specified as via-points, the distribution 
of the weight vector conditioned on a*(t) is given as a Gaussian with 
mean and variance 


bt = 3, — KH" (t)d,,. (3.43) 


where K = DoH! (t) (=. + H (t)5,H(t)) i and H is the observa- 
tion matrix defined as H = [W(t),..., &(t’)]'. By using this condition- 
ing, ProMPs can deal with modulation of via-points, final positions, or 
velocities. Figure 3.5 visualizes the conditioning of the trajectory dis- 
tribution on the target position as an example. 


3.5.1.5 Trajectory Representation with Time-Invariant Dy- 
namical Systems 


Khansari-Zadeh and Billard [2011] developed a framework to rep- 
resent task trajectories as a time-invariant dynamical system (DS) 
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Figure 3.5: Conditioning of the learned distribution on the target position 
[Paraschos et al., 2013]. 


[Gribovskaya et al., 2011, Khansari-Zadeh and Billard, 2014]. While 
DMPs model the attractor dynamics and nonlinear behavior as sepa- 
rate terms, this framework models demonstrated movements as a single 
nonlinear dynamical system. The trajectory generated from this time- 
invariant DS is stably attracted to the given target position in the 
Lyapunov sense. The time-invariant DS representation cannot repre- 
sent time-variant behavior by its nature. 

[Khansari-Zadeh and Billard, 2011, Gribovskaya et al., 2011] mod- 
eled demonstrated trajectories as an autonomous system |Khalil, 1996], 
which follows time-invariant dynamics as 


t = f(x), (3.44) 


where æ is the system state, and f is a function that governs the be- 
havior of the system. Khansari-Zadeh and Billard [2011], Gribovskaya 
et al. [2011] learn the function f as a GMM. 

Let us define a as the state vector of the system. When a set of 
demonstrated trajectories is given, the joint distribution of x and « can 
be estimated from the observations using a GMM. The kth component 
of the GMM models the distribution p(x, |k) as 


Lek rik 
, EN. 3.45 
| | Leek Dek on) 


The estimated dynamics function f is learned as 


My 


æa ~a (| A i 


K 
f= D haw) (me + Dawe Bz p(@— Mee)) > (3.46) 
k=l 
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where 

Dia plpleli) DL mN (2hari Bei) 

where mk is the prior of the kth Gaussian component. 
The study by Khansari-Zadeh and Billard [2011] showed that the 


system described by (3.46) is globally asymptotically stable at the tar- 
get æx* if the condition 


hy. (@) = (3.47) 


l AF + (AF)! is negative definite, (3.48) 


—A*g* = Mek — AP hrk: 


is satisfied for all k = 1,...,K where A* = Sank eal 

Khansari-Zadeh and Billard [2011] proved that (3.48) is the suffi- 
cient condition to show that the system is globally asymptotically stable 
in the sense of Lyapunov. For the details of the proof, we refer to the 
original paper [Khansari-Zadeh and Billard, 2011]. Khansari-Zadeh and 
Billard [2011] call this time-invariant DS represented by GMMs with 
constraints of (3.48) stable estimator of dynamical systems (SEDS). 

This representation with time-invariant DS is nonparametric, and 
models the correlation of movements in multiple DoFs. In addition, 
this approach can be also used to learn second-order dynamics as 
& = g(x, x) (please refer to [Khansari-Zadeh and Billard, 2011] for 
more details). The approaches with DS have been applied to various 
applications, such as learning coupled movements and learning stiff- 
ness [Shukla and Billard, 2012, Lukic et al., 2014, Kim et al., 2014]. 

The limitation of this approach is that the time-invariant repre- 
sentations cannot represent time-variant behaviors by its nature. In 
addition, due to the constraint of (3.48), SEDS can handle only mod- 
els in which the dimensions of the input and output are equal [Shukla 
and Billard, 2012]. 


3.5.2 Comparison of Trajectory Representations 


We show a comparison of different trajectory representations in Ta- 
ble 3.4. As can be seen from Table 3.4, every representation has 
strengths and weaknesses. 
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When choosing a trajectory representation, it is essential to con- 
sider the most parsimonious description for the desired trajectories 
and select a representation with a model complexity appropriate for 
the desired behavior. For example, SEDS in [Khansari-Zadeh and Bil- 
lard, 2011, 2014] represents the motion as a time-invariant dynamical 
system. Although SEDS may be insufficient to model time-dependent 
motions, SEDS works well for some tasks such as catching a flying ob- 
ject [Kim et al., 2014]. With regard to stable attraction to a target 
position, global asymptotic stability is guaranteed in the sense of Lya- 
punov for SEDS [Khansari-Zadeh and Billard, 2011]. This property is 
useful for planning a stable behavior to approach a target position. 

DMP is a good option for learning a point-to-point motion since 
motions can be easily generalized to different start and goal positions. 
In addition, bounded-input bounded-output (BIBO) stability is guar- 


Table 3.4: Comparison of trajectory representations. Time dependence means here 
that the learned policy differs for each time step. With regard to stable attraction 
to a target position, bounded-input bounded-output (BIBO) stability is guaranteed 
for DMPs [Ijspeert et al., 2013], and global asymptotic stability is guaranteed in 
the sense of Lyapunov for SEDS [Khansari-Zadeh and Billard, 2011]. Stochasticity 
of trajectories means that a method takes uncertainty into account when modeling 
behavior. Encoding spatial coordination means here that a method can explicitly 
model the coordination of multi-dimensional motions. 


Time Stable Stochasticity Encoding 
dependence attraction of spatial co- 
to a target trajectories ordination 
position patterns 
Way points / Keyframe 
[Abbeel et al., 2010, v - - - 
Nakaoka et al., 2007] 
HMMs 
[Inamura et al., 2004, 
Takano and Nakamura, (v) 7 v 4 
2015] 
DMP 
[Schaal et al., 2004, v v - = 
Ijspeert et al., 2013] 
ProMP 
{Paraschos et al., 2013, v - v v 
Maeda et al., 2016] 
SEDS 
[Khansari-Zadeh and - v - v 
Billard, 2011, 2014] 
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anteed with regard to stable attraction to a target position. For this 
reason, DMP is often used to represent primitive motions in task-level 
motion planning Kroemer et al. [2015], Niekum et al. [2014], Manschitz 
et al. [2015]. On the other hand, stochasticity of the demonstrated tra- 
jectories cannot be encoded by DMPs, and multi-dimensional motion 
needs to be modeled by separate DMPs. ProMPs can address these 
problems. However, unlike DMP and SEDS, ProMPs do not guarantee 
stability of planned trajectories. 

In this section we presented several different trajectory represen- 
tations and gave some suggestions how to choose them based on the 
different properties of the representations. However, the way to choose 
among the trajectory representations is still an interesting open ques- 
tion. Although efforts for benchmarking these different techniques have 
been made, e.g. [Lemme et al., 2015], it is necessary to establish metrics 
and benchmarks for comparing existing methods. 


3.5.3 Generalization of Demonstrated Trajectories 


Generalization of the demonstrated trajectories is one of the most im- 
portant problems in imitation learning. The parameterization of trajec- 
tories enables generalizing the movements to new scenes. For example, 
a movement represented as a DMP can be adapted to a new scene 
by changing parameters such as goal and start positions. A popular 
approach for generalizing a parametrized motion is conditioning Gaus- 
sian distributions. This approach appears in several frameworks such 
as ProMP and SEDS. However, generalization with conditioning on 
Gaussian distributions is limited to situations where feature vectors 
with fixed length are available. Therefore, these methods often require 
manually selected feature vectors which are sufficiently informative. 
Another way to generalize demonstrated skills is to leverage geomet- 
rical warping from a demonstrated scene to a new scene. Recent work 
such as [Schulman et al., 2013, Lee et al., 2015a,b, Huang et al., 2015] 
propose methods for generalizing skills to new scenes based on non-rigid 
registration of point clouds, which does not rely on feature vectors of 
fixed length. In the following, we describe a short overview of general- 
ization of demonstrated behaviors using different representations. 
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Motion Generalization with DMP: A trajectory represented with 
DMPs can be generalized to different start and goal positions [Schaal 
et al., 2004, Ijspeert et al., 2013]. For generalization according to addi- 
tional features, some extensions are required. For example, Amor et al. 
[2014] proposed to model the joint distribution of DMP parameters and 
generalize learned motion in human-robot interaction scenarios. 
Motion Generalization with ProMP: ProMP learns the distribu- 
tion of the demonstrated trajectory in a parameter space. By condi- 
tioning the learned distribution, we can generalize the demonstrated 
trajectories to new start and goal positions or via-points [Paraschos 
et al., 2013, Maeda et al., 2016]. Maeda et al. [2016] show how to adapt 
learned ProMP skills in the context of human-robot interaction. 
Motion Generalization with SEDS: Since the SEDS approach 
learns the joint distribution of the state and motion of the system, 
the demonstrated motion can be generalized to new states [Khansari- 
Zadeh and Billard, 2011]. 

Trajectory Transfer with Geometrical Warping: Although condi- 
tioning on Gaussian distributions are popular methods for generalizing 
skills, such methods are limited to the generalization with feature vec- 
tors with a fixed length. Another way to generalize demonstrated skills 
is to leverage Geometrical warping of the demonstrated scene to a new 
scene. Recently, Schulman et al. [2013] proposed a method to gener- 
alize the demonstrated trajectories based on non-rigid registration. In 
the non-rigid registration problem, one tries to find a correspondence 
between two point-sets and determine a good non-rigid transforma- 
tion that can map one point-set onto the other [Chui and Rangarajan, 
2003]. Thus far, non-rigid registration has been applied to for example 
template matching in OCR, motion generation in animation, or image 
registration in medical image analysis. Schulman et al. [2013] used non- 
rigid registration in order to transfer the demonstrated trajectories to 
new contexts as shown in Figure 3.6. 

The trajectory transfer method consists of three steps: 1) find a 
transformation from the training scene to the test scene using a non- 
rigid registration method, 2) apply the transformation to the demon- 
strated end-effector trajectory in task space, and 3) convert the end- 
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Figure 3.6: Trajectory transfer using non-rigid registration [Schulman et al., 2013]. 


effector trajectory in task space into a joint space. 

This method has been extended in various ways [Lee et al., 2015a,b, 
Huang et al., 2015]. Trajectory transfer with non-rigid registration can 
be used to generalize both spatial motion and force profiles [Lee et al., 
2015a]. Although the original work on trajectory transfer with non- 
rigid registration employed the thin plate spline robust point matching 
(TPS-RPM) approach proposed in [Chui and Rangarajan, 2003], the 
framework is not limited to specific non-rigid registration methods. The 
recent work by Lee et al. [2015b] shows that the use of the coherent 
point drift (CPD) algorithm improves trajectory transfer performance. 

Unlike methods such as ProMPs or the dynamical systems ap- 
proach, non-rigid registration based trajectory transfer works directly 
on point clouds and can generalize demonstrated trajectories to new 
scenes without modeling the distribution over demonstrated trajecto- 
ries. However, non-rigid trajectory transfer requires that system dy- 
namics are approximately invariant between source and target scenar- 
ios [Schulman et al., 2013]. In order to plan a trajectory in a new scene, 
one must select demonstrations performed in scenes with covariant sys- 
tem dynamics. For thousands of stored demonstrations, this search for 
an appropriate demonstration is a time-consuming process. 

We discussed generalizing policies to new demonstrated trajectories. 
Table 3.5 shows a comparison of methods for generalizing demonstrated 
trajectories. DMPs allow stable convergence to arbitrary goal positions, 
but DMPs’ generalization capability is relatively limited compared to 
other methods. ProMPs can generalize trajectories by Gaussian condi- 
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tioning, but there is no guarantee of stable behavior. SEDS can gener- 
alize the trajectories with a guarantee of stable behavior, but cannot 
model the time dependence of movements. Trajectory transfer using 
non-rigid registration can achieve complex generalization, but does not 
incorporate stochasticity in demonstrations and there is no guarantee 
of stable behavior. 

In addition to methods discussed above, there are numerous stud- 
ies on generalizing demonstrated trajectories. Calinon [2015] proposed 
task-parameterized Gaussian mixture model (TP-GMM), which en- 
codes the context information in its trajectory model. The approach 
based on TP-GMM has been recently employed in several studies [Cali- 
non, 2016, Rozo et al., 2016]. The recent work by Osa et al. [2017a] pro- 
posed a trajectory optimization method for collision avoidance, which 
incorporates the distribution of the demonstrated trajectories. In ad- 


Table 3.5: Generalization of skills using existing methods. DMPs enable stable con- 
vergence to arbitrary goal positions. ProMPs can generalize trajectories by Gaussian 
conditioning, but there is no guarantee of stable behavior. SEDS can generalize tra- 
jectories while guaranteeing stable behavior, but cannot model time dependence 
of movements. Trajectory transfer using non-rigid registration can achieve complex 
generalization, but does not incorporate stochasticity of demonstrations and there 
is no guarantee of stable behavior. 


Generalizable 


Method Sonteri Advantages Disadvantages 
DMP Limited 
[Schaal et al., 2004, prank an goal — of sence ton 
Tjspeert et al., 2013] PIRRE SANSE a capabilities 


ProMP 
[Paraschos et al., 2013, 
Maeda et al., 2016] 


Any subset of 
the observations 
of the system 


Generalization 
based on 
stochasticity of 


No guarantee of 
stable behavior 


demonstrations 
SEDS State of the Generalization No time- 
[Khansari-Zadeh and system with with guarantee dependence 
Billard, 2011, 2014] ad of stable 
ý , dimensionality behavior 
Generalization Stochasticity of 


Way points with 
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the given scene 


based on point 
clouds of a 
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dition, although we focused on the trajectory-based approach, recent 
work such as [Finn et al., 2017b, Nair et al., 2017, Liu et al., 2017, 
Rahmatizadeh et al., 2017] addressed the problem of generalizing skills 
based on visual information by using a deep learning approach, which 
is a promising way to deal with complex environments. 


3.5.4 Information Theoretic Understanding of Model-Free 
BC 


Trajectory representations such as DMP, ProMP, and SEDS param- 
eterize the trajectories as p(t|w) by solving linear equations using a 
least-squares method. Solving linear equations by minimizing a sum-of- 
squares error function is equivalent to maximizing the likelihood for the 
demo} N | under the assumption 


that the noise is drawn from a Gaussian distribution. This solution can 


given dataset of demonstrations D = {r 


be interpreted from an information theoretic point of view. 

According to information theory, the entropy is a quantity that 
represents the amount of information, and the KL divergence can be 
obtained as a Bregman divergence derived from the entropy [Amari, 
2016]. As described in [Bishop, 2006], finding parameters that maximize 
the likelihood p(7|w) for the given dataset is equivalent to minimizing 
the KL divergence given by 


q(T) 
Dia (a(7)||P(r|w)) = f ar) in 2 ar. 
p(t|w) 
where q(T) is the distribution of the trajectory induced by the experts’ 
policy. A sample of the demonstrated trajectories T¢°™° is drawn from 


the distribution q(T) induced by the experts’ policy. Therefore, the 
expectation with respect to q(T) can be approximated as 


N 


Dux, (a(7)llo(rlee)) = $X (~ inpr? w) + Ing(rF"°)) . (3.49) 


i=1 
Since lnq(T) is independent from w, minimizing Dpi (q(T)||p(T|w)) 
is equivalent to maximizing the likelihood ln p(T|w) for the given 
dataset D. 
Therefore, a policy obtained by model-free BC methods based on 
maximizing the likelihood under the Gaussian noise assumption can be 
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min Dxx(4(7)||p(T|w)) 
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Figure 3.7: Schematic illustration of model-free BC methods. Model-free BC meth- 
ods can be often interpreted as an M-projection onto the policy model manifold. 


regarded as the policy that minimizes the KL divergence as 
a = arg min Dki (q(7)||p(T|w))) . 


Thus, we can see that model-free methods discussed in the previous 
section parameterize the demonstrated behaviors by minimizing the 
KL divergence in a different parameter space as shown in Figure 3.7. 

It is important to note that these model-free methods can suffer 
from the problem of covariate shift where the distribution of the test 
condition is different from the distribution of the demonstrated con- 
ditions. In other words, the learned skill may not work when the test 
condition is too different from the demonstrated condition. To cope 
with this problem, we will need incremental learning methods, which 
are discussed in § 3.5.7. 


3.5.5 Time Alignment of Multiple Demonstrations 


When the expert demonstrates the task trajectory multiple times, the 
execution speeds are different for each demonstration. Therefore, when 
a task trajectory is learned from multiple demonstrations, the time 
alignments of the demonstrated trajectories often need to be normalized 
if a time-dependent trajectory representation is used. 

For this purpose, dynamic time warping (DTW) proposed by Sakoe 
and Chiba [1978] is often employed. Although DTW is originally devel- 
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Algorithm 6 Estimate the latent trajectory and the time alignments 
of multiple demonstrations [van den Berg et al., 2010] 


vie 
Tave 


Initialize: RÍ = I, and zj =z 
repeat 
€ — KalmanSmoother(y, R, z) 
R + arg maxpg Ee (I(R]E, y)) 
zÏ & arg max, Eg (I(z|€, y)) 
until convergence 


oped for speech recognition, DTW is frequently used to deal with the 
time alignment of trajectories in robotics. The original formulation of 
DTW finds the best time alignment of two data sequences. However, 
we often obtain more than two demonstrations, and we need to align 
all of them appropriately in the time domain. 

In the field of imitation learning, Coates et al. [2008] proposed 
a method to normalize the time alignment of multiple demonstrated 
trajectories. Similar approaches appear in applications such as au- 
tonomous helicopter flight [Abbeel et al., 2010] and automation of 
robotic surgery [van den Berg et al., 2010, Osa et al., 2014]. Here, 
we review the method employed by van den Berg et al. [2010]. 

van den Berg et al. [2010] regarded the demonstrated trajecto- 
ries as noisy ’observations’ of the ‘reference’ trajectories. The refer- 
ence trajectory and the time mapping from the reference trajectory to 
the demonstrated trajectory are computed using the EM (Expectation 
Maximization)-algorithm. 

The linear system is described as 


ies — | . 7 eos, wO ~n (0 | m al (3.50) 
where €(t) = [æ]! (t), u! (t)]' is the state and the control input of the 
system at time t, A and B are the state matrix and the input matrix, 
respectively. w(t) is the noise that follows the zero-mean Gaussian dis- 
tribution. P and Q are the covariance matrices of process noise and 
observation noise, respectively. If we assume that the jth demonstrated 
trajectory TÍ is given by TJ = [æi (0), u/(0),--- ,a7(T’), u?(T)], the 
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relation between the reference trajectory and the observed trajectories 
is represented as 


T (21) Í R 0 0 
=|: | é@) +), ANo] 9 > Q , (3.51) 
TN (aN) I 0 oO RN 


where v is the noise that follows a zero-mean Gaussian distribution, 
and zi is the mapping of time t in the reference trajectory € to the cor- 
responding time in trajectory TÍ. The covariance matrices RÍ behave 
as weights on the jth demonstrated trajectory TÍ for estimating the 
reference trajectory &. 

The reference trajectory €, covariance matrices R and the time- 
mapping 7 are estimated using the EM algorithm. In the E-step, the 
reference trajectory z can be estimated using a Kalman smoother based 
on the model in (3.50). In the M-step, the time mapping 7 and the 
covariance matrices R are updated by maximizing the likelihood with 
respect to the estimated z. DTW is used to update the time mapping 
T in [Abbeel et al., 2010, van den Berg et al., 2010]. This procedure is 
summarized in Algorithm 6. 


3.5.6 Learning Coupled Movements 


It is often necessary to learn the correlation of movements between 
multiple DoFs or multiple agents. For example, in human-robot inter- 
action, an autonomous agent needs to know how to react to a human 
operator’s movements. In such a case, the human movement and the 
robot reaction can be considered as coupled movements. In this section, 
we review how to learn such correlations of movements with multiple 
DoFs or agents. One typical approach is modeling the joint distribu- 
tion of the parameterized trajectories in multiple DoFs with a Gaussian 
(or a mixture of Gaussians) distribution. When partial observations of 
the coupled movements are given, the rest of movements are estimated 
by computing the conditional distribution on the partial observation. 
We will see in the following section that the choice of the trajectory 
representation plays an important role in modeling the trajectory dis- 
tribution. 
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3.5.6.1 Learning Coupled Movements with DMPs 


DMPs have been used to learn both perceptual coupling and coupling 
for human-robot collaborative motion [Kober et al., 2008, Amor et al., 
2014]. In robotic applications, a movement is often represented as tra- 
jectories in multiple spaces. For example, a position of an end effector 
can be measured using a vision system in Cartesian space, while a 
trajectory of a robotic manipulator is often controlled in joint space. 
When DMP is used, trajectories in different spaces are often learned 
as separate DMPs. However, it is essential to learn the coupling be- 
tween the trajectories in different spaces. Kober et al. [2008] proposed 
to learn such perceptual coupling for motor skills with DMPs. Instead 
of using the forcing function shown in (3.25), the perceptual coupling 
is modeled using the modified forcing function 


Me 


M 
f=P uez +} ie) (Kju -0) +o u-i), (8-52) 


where y denotes the state of the external variable, y is the expected 
state of the external variable, k and 6 are the coupling factors that 
act as the gains on difference between the desired and actual behaviors 
of the external variable. Me is the number of the basis function for 
modeling the coupled behavior. While the weight vectors w and w can 
be learned from a single demonstration, the coupling factors K and 6 
cannot be learned from demonstrations since the deviation from the 
nominal behavior is necessary for learning these parameters. For this 
reason, Kober et al. [2008] used a reinforcement learning method for 
learning «k and 6 through trial and error. 


3.5.6.2 Learning Coupled Movements with Gaussian Condi- 
tioning 


Statistical machine learning methods offer ways to model correlation 
of variables. For example, Gaussian conditioning is a simple way to 
model such correlations. Coupled motion in robotic applications can be 
learned using such statistical methods. Amor et al. [2014] represented 
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the motions of two agents using DMPs and learned the correlations of 
the distribution of the motion parameters. When one agent’s motion 
is observed, the motion of the other agent can be predicted based on 
Gaussian conditioning. 

Likewise, ProMPs have also been used to learn the correlation of 
multiple agents’ motion. Maeda et al. [2016] developed an imitation 
learning framework called Interaction ProMP to learn coupled motions 
in human-robot collaboration. In the framework of Interaction ProMP, 
correlated movements are learned as a distribution of the correlated 
weight vectors of ProMPs. Using a partial observation of the movement, 
unobserved movements are estimated as a conditional distribution of 
the weight vectors on the given partial observation. 

Here, we describe details of Interaction ProMP. Suppose demon- 
strations of human robot collaborative movements are given. Here, we 
define the state vector as a concatenation of the P DoFs executed by 
the human, followed by the Q DoFs executed by the robot 


x(t) = | oy) | ; (3.53) 


where æ?(t) is a P x 1 dimensional vector that represents the state of 
the human, and æ”(t) is a Q x 1 dimensional vector that represents 
the state of the robot at time t. The distribution of the trajectory is 
parameterized as 


p(a|w) = N(z|H' (t)w, £y), (3.54) 
where 
H’ (t) = diag(® | (t),..., Y1 (t)), (3.55) 


W' (t) is a M x2 matrix defined as (3.35) and M is the number of basis 
functions. When a trajectory of a human-robot collaborative movement 
is demonstrated, the weight vector w can be learned as 


© = [(w)",..., (wB), (W) y, (wo) "JT. (3.56) 


By learning from multiple demonstrations, we can obtain the distribu- 
tion of the weight vector p(w) ~ N (uo, 2a) where ug € R(P+@)Mx1 
and De € RIP+Q)Mx(P+Q)M | After learning the distribution of the 
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Figure 3.8: Overview of Interaction ProMPs in [Maeda et al., 2016]. In the interac- 
tion ProMP framework, correlated movements are learned as the joint distribution 
of weight vectors of ProMPs. Thanks to the probabilistic modeling of the trajectory 
distribution, the interaction ProMP framework works with noisy observations of 
trajectories [Maeda et al., 2016]. In this figure, œ represents the weight vector that 
contains movements of all DoFs controlled by the robot and the human operator as 
defined in (3.56). 
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weight vector p(w), the robot’s reaction to an observed human move- 
ment can be planned as the conditional distribution of the weight 
vectors. When a sequence of the observations of the human move- 
ment y* is given, the conditional distribution of the ProMP param- 
eters given the observation, p(w|y*), can be computed by applying 
the Bayes theorem (3.43). By using a mixture of Interaction ProMPs, 
the non-Gaussian distribution p(w) can be represented as a mixture of 
Gaussians [Ewerton et al., 2015, Maeda et al., 2016]. The framework 
of Interaction ProMPs is summarized in Figure 3.8. 

In the Interaction ProMP framework, correlated movements are 
learned as correlated weight vectors of ProMPs. Thanks to the proba- 
bilistic modeling of the trajectory distribution, the interaction ProMP 
framework works with noisy observations of trajectories [Maeda et al., 
2016}. 


3.5.6.3 Learning Coupled Movements with Time-Invariant 
Dynamical Systems 


The Time-invariant dynamical system (DS) approach in [Khansari- 
Zadeh and Billard, 2011] can be also used to learn coupled move- 
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ments [Shukla and Billard, 2012, Lukic et al., 2014, Kim et al., 2014]. 
Shukla and Billard [2012] developed a framework for learning coupled 
movement based on DS, which they call the Coupled Dynamical Sys- 
tem (CDS) model. The idea of CDS is to model the correlation between 
two agents using statistical models. 

Let us assume two agents, which we call the master and slave, 
perform a coupled motion. The correlation of the movement of the 
master x, and the movement of the slave x, can be modeled with CDS. 
In CDS, three GMMs are trained to model three joint distributions: 
1) the joint distribution of the master movement p(£m, &m) 

2) the joint distribution of the states of the master and the desired 
state of the slave p (®(am); eh) 

3) the joint distribution of the slave movement p(s, £s) 

where @, = £s — x? and æf is the desired state of the slave. To ensure 
the stability of the system, SEDS is used to model these three joint 
distributions [Khansari-Zadeh and Billard, 2011]. The function ®(-) 
maps £m to the same dimensionality of æ. This mapping is necessary 
because SEDS can handle only models in which the inputs and outputs 
have the same dimensionality [Shukla and Billard, 2012]. 

The reproduction of learned motions is performed by repeating 
three steps: First, the movement of the master is planned using 
P(Lm, Lm). Subsequently, the state of the slave is estimated based 
on p (x2|(am)). Third, the motion of the slave is planned based on 
p(£s, £s). These steps are repeated until the system converges to the 
goal position. The CDS approach has been applied to learn the cor- 
relation between the arm and fingers [Shukla and Billard, 2012, Kim 
et al., 2014], or the eye and arm [Lukic et al., 2014]. 


3.5.7 Incremental Trajectory Learning 


Demonstrations by human experts are not always optimal for the 
learner, and the performance of the learner can be unsatisfactory after 
learning from demonstrations. In such cases, corrective actions can be 
used to improve the performance of the learner. 

The study by Calinon and Billard [2007] extended the framework 
of statistical trajectory learning in [Calinon et al., 2007] to incremen- 
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Algorithm 7 Incremental gesture learning [Calinon and Billard, 2007] 


repeat 
Record the demonstrated trajectories 
Project demonstrated trajectories onto the latent space with PCA 
Recognize the motion 
Train GMMs 
Plan a trajectory in the latent space using the updated GMMs 
Re-project the planned trajectory onto the joint space 
Execute/simulate the trajectory 

until task learned 


tal learning. In [Calinon and Billard, 2007], GMMs are initialized with 
trajectories demonstrated by a human wearing a motion sensor. Subse- 
quently, the motion of the humanoid robot is modified through kines- 
thetic teaching by a human coach. Through this iterative process, the 
model of the trajectory distribution is improved incrementally. The 
method in [Calinon and Billard, 2007] is summarized in Algorithm 
7. The method in [Lee and Ott, 2011] used a similar representation 
by combining GMMs with HMMs. In the framework of [Lee and Ott, 
2011], the compliance of a robot manipulator is controlled in order to 
represent an area where motion refinement is allowed. However, the 
method in [Calinon and Billard, 2007] does not address the context 
of the task. Therefore, the generalization of the demonstrated trajecto- 
ries to new situations is not concerned. Recent follow-up work [Havoutis 
and Calinon, 2017] addressed the online learning and the adaptation 
of the skill to new contexts by combining an optimal control approach 
and TP-GMM in [Calinon, 2015). 

Ewerton et al. [2016] used ProMPs for incremental imitation with 
generalization to different contexts. Ewerton et al. [2016] parameterizes 
trajectories with ProMPs as p(7r|w). To generalize the demonstrated 
trajectories to new contexts, the joint distribution of trajectory param- 
eters and the Gaussian context p(w, s) is incrementally learned under 
the supervision of a human. Given a new context s"*”, the trajec- 
tory is planned as a conditional distribution p(r|s"°”). The method 
in [Ewerton et al., 2016] which is suitable for incremental learning of 
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Algorithm 8 Incremental imitation learning of context-dependent mo- 
tor skills [Ewerton et al., 2016] 


Input: demonstrated trajectories and the contexts D = {T, s} 


Initialize p(w, s) with D 
for each new context s do 
Compute Hwjs and Swjs 
Compute Hrs and 1), 
repeat 
Plan the trajectory based on p(r|s) 
Execute the trajectory with human intervention 
Record the context and the executed trajectory Tnew, Snew 
until human decides to stop 
Compute the weight vector Whew for Tnew 
Update p(w, s) using Whew and Snew 
end for 


time-dependent trajectories is summarized in Algorithm 8. Recently, 
an incremental learning method which combines DMPs and Gaussian 
Processes (GPs) was proposed by Maeda et al. [2017]. By modeling the 
conditional trajectory distribution with GPs, the system can generalize 
the trajectories to new scenes and request additional demonstrations 
when the prediction uncertainty is large. In addition, the convergence 
to the desired point can be ensured by DMPs. 

Kronander et al. [2015] proposed incremental trajectory learning 
using a local modulation in a time-invariant dynamical system. The 
concept of local modulation is applicable to various vector fields. We 
describe some details of the framework in the following. Let M(x) be 
the local modulation function. The velocity for the state æ is given by 


mod = M(x)@ini (3.57) 


where Zmoq is the velocity with the local modulation and &£ini is the 
velocity given by the initial dynamical system. The local modulation 
is represented by scaling and rotation of the original dynamics in the 
framework of [Kronander et al., 2015]. Therefore, the modulation func- 
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tion is given by 
M(x) = (1+ «(x))R(a) (3.58) 


where « is a scaling factor and R is a rotation matrix. For 2D motion 
R is parameterized by a rotation angle ¢. For 3D motion R is param- 
eterized by a rotation angle ¢ and the rotation vector Up. When local 
additional demonstrations are given, the nonlinear local dynamics is 
modeled with a GP. 

While a GP was used to model the local modulation, the frame- 
work in [Kronander et al., 2015] is not limited to a specific regression 
method. For movement which can be represented as a vector field, the 
method in [Kronander et al., 2015] is considered a reasonable option 
for incremental learning. 


3.5.8 Combining Multiple Expert Policies 


When multiple movement primitives can be learned, it is possible to 
combine movement primitives to generalize them to new situations. 
Jacobs et al. [1991] proposed the concept of mixture of experts, which 
generates a policy by mixing multiple experts’ policies. Given multiple 
experts’ policies {7;}@,, the policy can be obtained as a mixture of 
these policies 


ye Oma) 
pal Oi 


where o; is the weight on each expert policy. 


n(x) = , (3.59) 


Another way of combining multiple experts’ policies is products of 
experts proposed by Hinton [2002]. The policy can be obtained as a 
product of multiple experts’ policies 

M 
(ae) = Mme) _ (3.60) 
J [jai mi (x) dae 

In imitation learning literature, the concept of mixture of experts 
has been applied to multiple DMPs [Miilling et al., 2013]. Miilling et al. 
[2013] learned a library of DMP based movement primitives for hitting 


a table tennis ball. In [Miilling et al., 2013], given a new ball coming, a 
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mixture of learned policies generates a striking movement. In addition 
to initializing policies by learning from demonstration, Miilling et al. 
[2013] used a reinforcement learning method to improve the perfor- 
mance. 

Likewise, Ewerton et al. [2015] learned human-robot collaborative 
motions as a mixture of ProMPs. Ewerton et al. [2015] learned vari- 
ous interaction patterns as Gaussian Mixture models of ProMP weight 
vectors. This method can also be interpreted as a variant of mixture of 
experts. 

Haruno et al. [2001] proposed the modular selection and identifi- 
cation for control (MOSAIC) model, which learns multiple modules of 
forward and inverse dynamics models. In the MOSAIC model, each 
module learns local models, and the control input is determined by a 
mixture of multiple modules. 

Although the concept of products of experts has been used in 
reinforcement learning, it has not been popular in imitation learning 
so far. An interesting direction of future work could be using products 
of experts for combining multiple expert policies in imitation learning. 


3.6 Model-Free Behavioral Cloning for Task-Level Plan- 
ning 


When a task requires a complex motion, it is often necessary to plan 
the motion as a sequence of primitive motions. This kind of high level 
motion planning is known as task-level planning [Lozano-Perez et al., 
1989, Ekvall and Kragic, 2008, Cambon et al., 2009, Lagriffoul et al., 
2014]. In this section, we review model-free behavioral cloning methods 
for task-level planning. 


3.6.1 Segmentation and Clustering for Task-Level Planning 


Although model-free methods for trajectory learning often implicitly 
assume that each demonstrated trajectory contains a single motion, a 
demonstrated trajectory may consist of a sequence of different types of 
primitive motions in practice. Therefore, in order to learn each prim- 
itive motion, it is necessary to segment the demonstrated trajectory. 
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In addition, after the segmentation of trajectories, it is often neces- 
sary to cluster the segmented motions in order to learn multiple types 
of primitive motions. However, manual segmentation and clustering of 
trajectories is often time-consuming. For this reason, methods for seg- 
menting and clustering the demonstrated trajectories have been inves- 
tigated in the field of imitation learning. The development of methods 
for trajectory segmentation is closely related to the theoretical advances 
in clustering in machine learning. Although theories for segmentation 
and clustering are out of our scope, we shortly review methods for 
segmentation and clustering in imitation learning. 

Kohlmorgen and Lemm [2001] developed an online segmentation 
method based on HMMs. By computing the “distance" between nearby 
data windows, Kohlmorgen and Lemm [2001] segments human mo- 
tion data using unsupervised learning. Kulié et al. [2008] proposed a 
method for segmenting and clustering whole body motions by using 
factorized HMMs. In their method, the distances between HMMs are 
computed, and segments of the observed motion are clustered into a 
tree structure. Fearnhead and Liu [2007] proposed an online direct sim- 
ulation algorithm for online inference in change-point problems (prob- 
lems where the probability distribution changes at “change-points”). 
Konidaris et al. [2011] extended the approach in [Fearnhead and Liu, 
2007] to learning skill trees. The beta process autoregressive HMM (BP- 
AR-HMM) developed by Fox et al. [2009] is a Bayesian nonparametric 
approach, which finds dynamic features in time-series data. The BP- 
AR-HMM is also employed by Niekum et al. [2014] for learning primi- 
tive motion sequences in robotics. As seen from these previous studies, 
advances in trajectory segmentation in imitation learning [Kuli¢ et al., 
2008, Konidaris et al., 2011, Niekum et al., 2014] are closely related 
to the methodological advances [Fearnhead and Liu, 2007, Fox et al., 
2009] in the machine learning community. 


3.6.2 Learning a Sequence of Primitive Motions 


For learning a sequence of primitive motions, it is necessary to model 
the structure of the skill and learn the transition between primitive 
motions from the demonstrated behavior. 
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MP Library Goal Learning Graph Learning Switching Behavior 


Figure 3.9: Learning a motion sequence in [Manschitz et al., 2015]. A library 
of movement primitives are learned from demonstrations, and transitions between 
movement primitives are modeled using SVMs. 


One way of learning a sequence of movement primitives is to learn 
a tree-like structure of skills. Konidaris et al. [2011] proposed an on- 
line algorithm for constructing skill trees from demonstrations. Based 
on change point detection using MAP estimation [Fearnhead and Liu, 
2007], a demonstrated trajectory is segmented into a skill chain. Multi- 
ple skill chains are merged into a skill tree by identifying similar skills 
in different skill chains. The method in [Konidaris et al., 2011] has been 
applied to path planning of a mobile robot. 

Another way to sequence movements is to learn a transition model 
between different movement primitives. Manschitz et al. [2015] learns 
a library of movement primitives and uses a support vector machine 
(SVM) to compute the solution to the multi-class classification prob- 
lem of choosing the next movement primitive for each current move- 
ment primitive. This results in a movement primitive graph structure 
as shown in Figure 3.9. 

For learning a probabilistic transition model between movement 
primitives, HMM-based methods are often used. In the autoregres- 
sive hidden Markov Model (STARHMM) [Kroemer et al., 2014] the 
probability distribution over latent variables also depends on the ob- 
served state contrary to the classical auto-regressive hidden Markov 
model (AR-HMM) where the current state depends only on the pre- 
vious state. STARHMM includes a latent phase variable that defines 
the current phase of the task. The framework in [Kroemer et al., 2015] 
uses STARHMM to represent a task as a sequence of DMPs [Ijspeert 
et al., 2013], where the phase variable corresponds to the currently 
active DMP. The model allows for a conditional movement primitive 
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(a) (b) 
Figure 3.10: Learning a hierarchical skill in [Kroemer et al., 2015]. Left: A sequence 


of skills are modeled using a variant of HMM. Right: The learned DMPs can be 
adapted to different objects. 


Algorithm 9 Incremental semantically grounded learning from 
demonstration [Niekum et al., 2014] 


Input: Demonstrated trajectories and object poses D = {r4°™°, o} 
Segment the demonstrations with BP-AR-HMM 
for each segment do 
Learn parameters of DMP 
end for 
Construct FSM 
Replay the task based on the current observation 
if correction is necessary then 
Collect interactive correction from users 
end if 


plan that switches from one DMP to another based on the observations. 
Kroemer et al. [2015] learn DMPs using imitation learning and optimize 
high-level policies using reinforcement learning. Kroemer et al. [2015] 
demonstrate the approach in robotic manipulation tasks as shown in 
Figure 3.10. 

Although it is often assumed that a sufficient amount of demonstra- 
tion data is available, this may not be the case in many applications. 
Incremental imitation learning for task-level planning proposed by 
Niekum et al. [2014] can address this issue. The framework in [Niekum 
et al., 2014] leverages unstructured demonstrations and corrective ac- 
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Figure 3.11: Mutual language model between motion and sequence in [Takano 
and Nakamura, 2015](Figure used with permission of Wataru Takano). Relevance 
between words and motion is learned using a probabilistic model. The approach 
can work in two directions: generating sentences from motion or generating motion 
from sentences. When motion is observed, a motion language semantic graph model 
generates words for the observed motion. A natural language model arranges the 
words then into sentences. When observing language the language is segmented into 
words using a natural language model and the words are then transformed into 
motion using a semantic graph. 


tions by human experts. Niekum et al. [2014] segment the demonstrated 
task using a Beta Process Autoregressive Hidden Markov Model (BP- 
AR-HMM) [Fox et al., 2009], and model the transition between discrete 
primitives as a finite-state automaton (FSA). When a new situation is 
given, the learner uses the trained FSA to plan the task as a sequence 
of movement primitives. If an expert considers that refinement of the 
planned motion is necessary, she/he can stop the autonomous execu- 
tion of the task and correct the motion through kinesthetic teaching. 
In this way, the learner improves the performance through interaction 
with experts. Algorithm 9 summarizes the procedure. 

One interesting approach for task-level planning is to leverage an- 
notation of demonstrated motions. Recently, Takano and Nakamura 
[2015] developed methods for learning a mutual model between lan- 
guage and motions, which leverage a dataset of demonstrated motions 
and annotated sentences. In the framework of [Takano and Nakamura, 
2015], the relationship between the motion symbols and words via la- 
tent variables is learned as a motion language model, and the sentence 
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Algorithm 10 Motion language model [Takano and Nakamura, 2015] 


Learning: 
demo 
bi 


Input: demonstrated trajectories and sentences D = {r y} 


Train a set of HMMs that represent the primitive motions 
Train the motion language model and the natural language model 
Prediction: 
Input: a motion sequence or a sentence 
if the given input is a motion sequence then 
Recognize the motion symbol A® using HMMs 
Predict words for the given motion 
y* = arg maxy p(y|d'”) 
Arrange the order of the words using the natural language model 
return sentence 
end if 
if the given input is a sentence then 
Predict a motion symbol corresponding to the given sentence y™ 
à* = arg max ea p(Aly™) 
Predict the motion sequence from the motion symbol A* 
return motion sequence 
end if 


structure is learned as a natural language model using an n-gram model. 
Figure 3.11 summarizes the framework of a mutual model between lan- 
guage and motion. HMMs are used to represent primitive motions, 
and the library of primitive motions are learned as a set of HMMs. 
In the motion language model, the probability p(A|y) and p(y|A) are 
learned, where y is an annotated sentence and A is the motion symbol. 
This motion language model can be learned using an EM algorithm. 
Meanwhile, a natural language model learns the transition between two 
words p(y;|y;). When a new motion T?” is observed, the correspond- 
ing motion symbol X”” is predicted using HMMs. Subsequently, words 
associated with the motion symbol are estimated as 


y* =arg max p(y|A""), (3.61) 


where \! is the recognized motion symbol. Thereafter, the estimated 
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words are arranged grammatically using the natural language model. 
When a new sentence y™ is given, the motion symbol is selected so 
as to maximize the likelihood of observing y™ 


d* = arg max p(Aly"”), (3.62) 
AEA 


where A is a set of learned motion symbols, and X* is the predicted 
motion symbol. A motion sequence is then generated using the pre- 
dicted motion symbol. The method is summarized in Algorithm 10. 
Leveraging the mutual model between language and motion will be an 
interesting research direction in imitation learning. 
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3.7 Model-Based Behavioral Cloning Methods 


We discuss model-based behavioral cloning (BC) in this section. As 
we discussed in §2.3, model-based BC methods require an iterative 
learning process with access to a forward dynamics model. Next, we 
discuss model-based BC in more detail. 


3.7.1 Model-Based Behavioral Cloning Methods with 
Forward Dynamics Models 


In imitation learning, experts demonstrate behavior and an au- 
tonomous agent tries to imitate the demonstrations. However, the em- 
bodiment of the expert is often different from the embodiment of the 
learner. In such cases, the demonstrated trajectory needs to be ad- 
justed for the embodiment of the learner. Otherwise, the learner fails 
to perform the intended task properly. This problem is known as the 
“correspondence problem” in imitation learning [Billard et al., 2008]. 
The correspondence problem frequently appears when we try to teach 
humanoids how to imitate human motion obtained e.g. from motion 
trackers [Ude et al., 2004, Nakaoka et al., 2007]. Due to the different 
embodiments between a human expert and a robot learner, it is es- 
sential to adapt the demonstrated trajectories to follow the constraints 
and dynamics of the learner. 

Even when the embodiments of the demonstrator and learner 
match, we may face a similar correspondence problem when we try 
to execute a trajectory at a velocity differing from the original veloc- 
ity [van den Berg et al., 2010, Englert et al., 2013]. Even if the desired 
configuration is kinematically feasible, the demonstrated /desired veloc- 
ity may be infeasible due to the underactuation of the manipulator. In 
this case, it is also necessary to adjust the planned trajectory based on 
the system dynamics. 
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The straightforward way for solving the correspondence problem is 
to explicitly learn a forward dynamics model of the system 


Lii = f (xt, Ut) (3.63) 


and then plan trajectories based on the learned forward model. Forward 
dynamics model learning can be framed as a regression problem. Ta- 
ble 3.6 lists different regression methods which have been utilized in 
model-based BC. Although locally weighted regression and Gaussian 
mixture regression were used in early studies of model-based methods, 
recent studies often employ Gaussian Processes. As we will review in 
§3.7.1.2, Gaussian Processes can incorporate inputs with uncertainty. 
This property is important for multi-step forward prediction since the 
uncertainty is propagated over time. However, due to the computational 
cost, Gaussian Process regression is not suitable for high-dimensional 
data. To deal with high-dimensional data such as raw images, a deep 
learning approach is employed for modeling a forward dynamics in the 
most recent studies [Oh et al., 2015, Finn et al., 2017a, Baram et al., 
2017, Nair et al., 2017]. In the following sections, we review some of 
the model-based methods with explicit learning of a forward model. 


Table 3.6: Model-based behavioral cloning methods using different regression meth- 
ods. Early studies on model-based behavior cloning focused on locally weighted 
regression but later studies have moved to Gaussian mixture regression and even 
more recently to Gaussian processes. We expect that studies based on deep neural 
networks will be popular in the near future. 


Regression Employed by ... 
Locally Weighted 
Regression 


Atkeson et al., 1997, Schneider, 1997] 


ian Mist 
aah eels Grimes et al., 2006b, Grimes and Rao, 


Regression 2009] 


Cun Pears Grimes et al., 2006a, Englert et al., 2013, 


Deisenroth et al., 2014] 
Neural Networks Baram et al., 2017, Nair et al., 2017] 


Regression 
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3.7.1.1 Imitation with a Gaussian Mixture Forward Model 


We will now discuss details of the methods in [Grimes et al., 2006b, 
Grimes and Rao, 2009] as an example of learning forward dynamics 
with Gaussian Mixture Models (GMMs). 

We can obtain a dataset of state x; and action ur trajectories D = 
{ri = [ri ui- æi, ui] from sensor readings. If we introduce 
zt = (xt, utl, the joint distribution of a4, and z; can be modeled as a 
mixture of Gaussian distributions as 


P(@141, 21) = >| P) (Hp, Ze), (3.64) 
k 


where p(k) is the prior and the kth Gaussian component is given by 


By k Xor k 
. (3.65 
| ? | Dey k Yak }) ( ) 


The conditional distribution of 2,41 for a given z¥ is a Gaussian dis- 


Zt Mak 


Lt+1 Mak 


plti, zilk) =M (| 


tribution with the mean and variance given by 
Hrjz = 5 WkHgrjz,k? 
K 
k=1 


where 


Hzjz,k = Mak + Donen) (27 = Mek) 
Delz,k = Dak a izk (Saa Drk (3.67) 
DPN (2 f | Mees Bek) 

Mri P(K)N (z{| Hz ks Disk) 
When a given input is drawn from a Gaussian distribution zš ~ 
N (pit, 5”), the marginal distribution p(a41|u", 5) is a Gaussian 
distribution with the mean u}; and covariance X441 given by 


Wk 


Mz\z TE 5 WkHgrjz,k? 
K 


Dalz = 5 Wk em + H|z,kbtale,k) = HrjzHajz (3.68) 
k=1 
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Algorithm 11 Behavior acquisition via Bayesian inference and learn- 
ing [Grimes and Rao, 2009] 


Observe an expert’s demonstrations [01,--- , or] 


Estimate the kinematics of the expert 
Initialize the forward model f 
Infer bootstrap actions based on the forward model 
repeat 
Execute actions 
Learn/update the GMR forward model 
Infer constrained actions 
until task learned 


where 
inj lya 
Malek = Map + Uaz,k (Zen > ) (z = pen) 
-\—l 
wk t+] = Yak = Laz,k (Zep T s) D zg,ks (3.69) 
_ PRN (Zila Bee +B) 
Dia D(K)N (27 |H ks Xk T 2) 


Grimes and Rao [2009] used this GMR for one-step prediction and 
recursively predicted learner’s trajectories. Using the learned forward 


Wk 


model, the action is selected so as to maximize the posterior likelihood 


as 
x * 
Uy, Up = arg max plui, :-: ,Ur|O1,°-: OT, C1; ,er), (3.70) 
"ai = Tr. 
where [0,:-- , orp] is a time series of the observed demonstrated states, 
and |c),--- , cr] isa time series of the feasible states of the learner under 


the kinematic and dynamic constraints. By repeating the execution of 
the planned trajectories, the estimation of the forward model improves. 
Algorithm 11 summarizes the procedure in [Grimes and Rao, 2009]. 


3.7.1.2 Imitation with a Gaussian Process Forward Model 


Recent studies on model-based BC have employed Gaussian Processes 
(GPs) for modeling the forward dynamics of the system f ~ GP 
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[Englert et al., 2013, Deisenroth et al., 2014]. Given a dataset D = 
{£41 zt} where z = [a ,u/]', a GP models a mapping from the 


input z+ to the output x41 = f(z) as 
f (Zt) ~ GP (m(zt), k(zt, 24) , (3.71) 


where k(z,z’) is the covariance function. A popular choice of a co- 
variance function is the squared exponential covariance function given 


by 


) = exp {le = ZË 
k(z, 2’) = exp 2 (3.72) 


The joint distribution of the given target value and the function value 
tt41 at the test input z¥ can be written as 


Let K(Z,Z)+02I K(Z,2z?) 
| wt | an (0 | K(zž, Z) K (27,27) }) f (aeea) 


where Z is a matrix in which the input vectors z; for all training 
samples are aggregated. The conditional distribution of x7,, given the 
test input z¥ is a Gaussian with mean and variance 


u(z}) = k' Kt, 


3.74 
P= ki- k Kk, aa 


where K = K(Z,Z)+02I and k = K(zž¥, Z). 

As with GMR, propagation of uncertainty can be approximately 
modeled by GPs. If we assume that z = [æ], uf]! is drawn from a 
Gaussian distribution p(z:|H4, 44), the predictive distribution of the 


state at time t+ 1 is given by 


plarsalee,%1) = f plf(e)lze,D)ple)dz, (3.75) 


where p(f(æ)|æx, D) is a Gaussian distribution given by (3.74). The 
marginal distribution p(®++1ı| H, X+) in (3.75) can be approximated by 
a Gaussian distribution by following the results from [Deisenroth and 
Rasmussen, 2011, Deisenroth et al., 2013al. 

Englert et al. [2013] used GPs for predicting the trajectory distri- 
bution, and the KL divergence was used to evaluate the similarity of 
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Algorithm 12 Probabilistic model-based imitation learning [Englert 
et al., 2013] 


Input: n trajectories 7; demonstrated by the expert 


Estimate the expert distribution over trajectories g(T?°™°) 


Record state-action parts of the robot through applying random con- 
trol inputs 
repeat i = 1 to N do 
Learn/update probabilistic GP forward model 
Predict the new trajectory distribution p(T) 
Learn policy 7’ = arg min, DKL (a(r*"°)||p(7)) 
Apply 7" to the system and record data 
until task learned 


the demonstrated and learned behaviors. Englert et al. [2013] modeled 
trajectories as a Gaussian distribution 


T 
p(T) ~ TEGO - [Me a(t)|(t), &(t)). (3.76) 


For two given Gaussian distributions p(a(t)) ~ N (a|u,(t), Ep(t)) and 
q(a(t)) ~ N (x|u;(t), Hq(t)), the KL divergence of q and p can be com- 
puted in closed form. Using the factorization in (3.76), the KL diver- 
gence between the trajectory distribution induced by the expert policy 
q(T) and the trajectory distribution induced by the learned policy p(T) 
can be computed as 


Dri (4(7)||p(r >> Dxx(q (w(t) |p (w(t). (3.77) 


Englert et al. [2013] used this KL divergence to define the objec- 
tive function to be minimized as Cer = Dz (q(T)||p(T)). To min- 
imize Lpg we can compute the gradient analytically and use gradi- 
ent descent [Deisenroth, 2010, Deisenroth and Rasmussen, 2011]. Al- 
gorithm 12 summarizes the procedure in [Englert et al., 2013]. The 
method in [Englert et al., 2013] matches the first and second moment of 
the trajectory distribution through iterative learning. Since the deriva- 
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Algorithm 13 Iterative control learning [van den Berg et al., 2010] 


Input: desired trajectory T, learning rate a 
Initialize the target trajectory as T = TÎ 
repeat 
Execute a controller with the target trajectory 7 
Record the executed trajectory T 
Update the target trajectory 7 + ê — a(t — T?) 


until r ~ rf 


tives can be analytically computed when using a GP forward dynamics 
model, imitation learning can be efficiently performed. 


3.7.2 Imitation Learning through Iterative Learning Control 


In order to develop a controller to achieve the desired trajectory, we 
can also use an iterative learning control approach without a forward 
dynamics model. Abbeel et al. [2010], van den Berg et al. [2010] learn 
a controller iteratively to reproduce a desired trajectory. 

While van den Berg et al. [2010] uses a Linear Quadratic Regulator 
(LQR) [Anderson and Moore, 1990] for optimal control, the method is 
not limited to a specific controller. Algorithm 13 shows how iterative 
control learning in [van den Berg et al., 2010] works. Given a desired 
trajectory T?, LQR control is performed to track the target trajectory 
+. In the initial step, the target trajectory is initialized as 7 = T°. 
When the executed trajectory T deviates from the desired trajectory 
Tt, the approach updates the target trajectory as 7 + t—a(r — 7%), 
where a is the learning rate. By repeating this execution and update, 
a target trajectory T + T* can be obtained. Although this method is 
simple and easy to implement, the controller cannot be generalized to 
different desired trajectories. 

When a given system is fully controllable, we can learn forward and 
inverse dynamics of the system. As indicated by [Nguyen-Tuong and 
Peters, 2011], various methods have been developed for model learning. 
However, it is often challenging to apply such approaches to not fully 
controllable systems. Iterative LQR (iLQR) is often employed to con- 
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Figure 3.12: Schematic illustration of model-based BC methods. Model-based BC 
methods often iterate between policy updates and task execution so as to match the 
expected features as E,[] ~ E,[@]. 


trol a system of which the dynamics is not accurately known [Todorov 
and Li, 2005, Abbeel et al., 2010, Tassa et al., 2012]. iLQR learns a 
linear feedback controller to follow a trajectory through an iterative 
learning process. Abbeel et al. [2010] learns from experts’ demonstra- 
tions trajectories for acrobatic RC helicopter flights, and utilizes iLQR 
to reproduce the desired trajectory. 


3.7.3 Information Theoretic Understandings of Model- 
Based Behavioral Cloning Methods 


BC methods with forward dynamics models such as [Englert et al., 
2013, Grimes and Rao, 2009] iteratively evaluate the learned policy 
m'(u|a) in order to reproduce trajectories close to the demonstrations. 
These methods evaluate the trajectory under the distribution induced 
by the learned policy and match its expected feature with that of the 
expert demonstrations. This approach can be interpreted as a process 
to empirically learn the policy 7(u|a) that satisfies 


EylO(7)] = Eyl o(7)}, (3.78) 


where q(T) is the expert trajectory distribution and p(T) is the trajec- 


tory distribution induced by the learner’s policy. The learning process 
of BC methods with forward dynamics can be illustrated as Figure 3.12. 
In addition, the method in [Englert et al., 2013] assumes that the tra- 
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jectory distribution is Gaussian. As Park and Bera [2009] indicated, 
Gaussian distribution is one of the maximum entropy distributions. 
Therefore, matching the feature expectation as in (3.78) under the 
assumption of the Gaussian distribution can be interpreted as the M- 
projection onto the manifold of the maximum entropy distribution as 
we discussed in §2.7.1. 


3.8 Robot Applications with Model-Free BC Methods 


Robot Applications with Model-Free Behavioral Cloning Methods In 
this section, we show several examples of model-free behavioral cloning 
(BC) in robotic applications, to demonstrate the capability of model- 
free BC methods. Model-free BC methods have been utilized suc- 
cessfully in various applications, including autonomous RC helicopter 
flight, ball-hitting tasks, and robotic surgery. Abbeel et al. [2010] uses 
an iterative LQR controller in acrobatic helicopter flight to control the 
nonlinear system. [Osa et al., 2017b] performs knot-tying tasks using a 
standard PD controller on a surgical robot. From the following applica- 
tion examples, one can see that different applications require different 
controllers and learning methods. 


3.8.1 Learning to Hit a Ball with DMP 


Hitting a ball is a typical example of tasks that can be learned as 
a point-to-point motion. Ijspeert et al. [2002b] showed that a tennis 
swing can be learned with DMPs. The motion of a tennis swing was 
demonstrated by a human, and the motion was recorded using a mo- 
tion capture suit, which can mechanically measure the joint angles of 
35 DoFs of the human body at 100Hz. The recorded motion was re- 
produced in a humanoid robot with 30 DoFs. To accurately reproduce 
the trajectories, an inverse dynamics controller was employed in this 
experiment. The experimental results showed that the learned motion 
was generalized to different target positions. 
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Figure 3.13: Learning rhythmic motions for the Ball-Paddling task in [Kober and 
Peters, 2009]. Kober and Peters [2009] used kinesthetic teaching to demonstrate 
periodic hitting motions in Ball-Paddling and trained rhythmic DMPs to reproduce 
the demonstrated periodic movements. 


Kober and Peters [2009] learned a Ball-Paddling task shown in Fig- 
ure 3.13 from demonstrations. The goal of this task is to have the ball 
repeatedly bouncing. Kober and Peters [2009] used the seven degrees 
of freedom Barrett WAM arm to demonstrate trajectories using kines- 
thetic teaching and learned periodic motion using rhythmic DMPs. In 
the experiments, ten basis functions per motor primitive were used to 
represent the task. 


3.8.2 Learning Hand-Over Tasks with ProMPs 


Motion planning in the context of human-robot collaboration often re- 
quires learning the coupled motions of a human operator and a robot. 
Maeda et al. [2016] shows that correlation of the two agents’ motion 
can be modeled using ProMPs. Maeda et al. [2016] illustrates the ap- 
proach in a hand-over motion: when a human extends her/his hand to 
receive a plate or screw, the robot grasps and gives it to the human 
operator. Maeda et al. [2016] used a KUKA LWR robot and kines- 
thetic teaching for demonstrating tasks, and the motion of a human 
operator was tracked using a 3D optical tracking system. The task 
was demonstrated 13-20 times. Demonstrated trajectories are shown 
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Figure 3.14: Learning human-robot collaborative motions in [Maeda et al., 2016]. 
Maeda et al. [2016] used kinesthetic teaching to demonstrate coupled movements, 
where both the human and robot need to move to perform a task. The demonstra- 
tions were used to train interaction ProMPs which take correlations between human 
and robot movement into account: the robot motion can be planned as conditional 
distribution given the human movement. The pictures show how the robot is able 
to adapt its movement in several tasks. 


in Figure 3.14. The correlation of the robot’s motion and the human 
operator’s motion was learned with interaction ProMPs, which is an 
extension of ProMPs proposed by Paraschos et al. [2013]. To achieve 
the human-robot collaborative task, the robot motion was planned by 
conditioning the learned distribution on the observed motion of the 
human operator. Maeda et al. [2016] applied interaction ProMPs to 
several tasks as shown in Figure 3.14. The study by Maeda et al. [2016] 
showed that the reactive motions of the robot were successfully planned 
based on the observed motions of the human operator. 

Recent work by Lioutikov et al. [2017] proposed a method for seg- 
menting demonstrated trajectory in a probabilistic manner and learn- 
ing a sequence of movement primitives represented by ProMPs. Tasks 
that emulate table tennis, writing and chair assembly are reported in 
[Lioutikov et al., 2017]. 
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(a) Slave manipulator (b) Visualization of planned trajectories 


Figure 3.15: Autonomous knot-tying with a surgical robot [Osa et al., 2017b]. Left: 
Bimanual manipulation tasks were learned using a model-free BC method. Right: 
The trajectories can be updated in real time when the context is changing during 
task execution. The demonstration was performed under various contexts, and the 
trajectory distribution was modeled using a Gaussian Process. A force controller 
was build as an outer loop of the standard PD position controller. 


3.8.3 Learning to Tie a Knot by Modeling the Trajectory 
Distribution with Gaussian Processes 


Knot-tying in robotic surgery is one of the tasks that is hard to learn 
as a sequence of point-to-point motions. In a looping motion required 
for the knot-tying task, the topological shape of the entire trajectory 
is critical, although the start and goal positions of the trajectory is 
not critical to the success of the task. Osa et al. [2017b] applied a 
behavioral cloning method to this knot-tying task as shown in Fig- 
ure 3.15. Osa et al. [2017b] learned a conditional distribution of the 
demonstrated trajectories given the context as a Gaussian Process al- 
lowing generalizing demonstrated trajectories to a new context in real 
time. Additionally, the learned trajectory distribution was used to plan 
and control the contact force between the surgical instruments and ob- 
jects. Osa et al. [2014] employed Algorithm 6 for normalizing the time 
alignment of multiple demonstrated trajectories. 

In experiments with a bimanual teleoperated master-slave system 
for robotic surgery shown in Figure 3.15, the system performed tasks 
that emulate tying a knot and cutting soft tissues. The task was demon- 
strated 9-20 times under various contexts. The experimental results 
show that the trajectories can be updated in real time. 
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Figure 3.16: Learning autonomous helicopter maneuvers from expert demonstra- 
tions in [Abbeel et al., 2010]. Acrobatic flights were learned in a system that involves 
highly nonlinear dynamics. An iterative LQR controller is employed to execute the 
trajectory learned from demonstrations. 


3.9 Robot Applications with Model-Based Behavioral 
Cloning Methods 


We present applications of model-based BC methods in this section. 
Model-based BC methods can be used to control robotic systems with 
nonlinear dynamics. A remarkable application example of model-based 
BC methods is acrobatic helicopter flights [Abbeel et al., 2010]. Addi- 
tionally, we discuss an application for learning from different embodi- 
ments. Subsequently, we show applications of planning in action-state 
space. 


3.9.1 Learning Acrobatic Helicopter Flights 


Autonomous flight of an RC helicopter involves nonlinear dynamics, 
making helicopter control non-trivial. Abbeel et al. [2010] showed how 
to learn acrobatic RC helicopter flight from experts’ demonstrations. 
For modeling time-dependent trajectories, Abbeel et al. [2010] nor- 
malizes the temporal alignment of the demonstrated trajectories using 
an Expectation Maximization (EM)-like method, which we discussed 
in §3.5.5. Abbeel et al. [2010] learns acrobatic flight trajectories us- 
ing a model-based behavioral cloning method. Due to the challenge 
of controlling the highly nonlinear helicopter dynamics, Abbeel et al. 
[2010] uses an iterative LQR controller. In the experiments, the heli- 
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Figure 3.17: Learning to hit a ball with an underactuated manipulator in [En- 
glert et al., 2013]. Englert et al. [2013] learned a forward model of the system 
using Gaussian Processes. Together with the forward model Englert et al. [2013] 
used PILCO Deisenroth and Rasmussen [2011], Deisenroth et al. [2013a] as the 
reinforcement learning method Englert et al. [2013] to train a policy to reproduce 
demonstrated trajectories. 


copter control system performs various maneuvers including in-place 
flips, in-place rolls, loops and hurricanes, and even auto-rotation land- 
ings, chaos and tic-toc. Figure 3.16 shows a snapshot of the acrobatic 
flight reported in [Abbeel et al., 2010]. Previously, these acrobatic ma- 
neuvers could only be performed by exceptional experts, but Abbeel 
et al. [2010] showed that such expert skills can be transferred to a 
robotic system by combining model-based BC and iterative controller 
learning. 


3.9.2 Learning to Hit a Ball with an Underactuated Robot 


Learning tasks with an underactuated robot is challenging since fea- 
sible trajectories are limited. Englert et al. [2013] learned ball hitting 
with an underactuated robot using a model-based imitation learning 
method. In the experiments, the trajectories were demonstrated by 
kinesthetic teaching, and the trajectory and the controller to achieve 
the task were learned from demonstrations. BioRob™ [Lens et al., 
2010] robot, which is an underactuated and compliant manipulator, 
was used in the experiments. Figure 3.17 shows a task with the under- 
actuated manipulator reported in [Englert et al., 2013]. Since Englert 
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Figure 3.18: Applications of DAGGER. [Ross et al., 2011]. Left: Learning to play 
a video game [Ross et al., 2011]. Right: Learning autonomous UAV flight [Ross 
et al., 2013]. The UAV flew autonomously in real forest environments. In DAGGER 
, the learner complements initial demonstrations by querying an expert online for 
demonstrations specifically for states induced by the learner’s policy. 


et al. [2013] learns a robot-specific controller, the controller is robust 
to the correspondence problem compared with model-free behavioral 
cloning methods. Learning a robot-specific policy is one of the benefits 
of model-based imitation learning. Although developing a controller for 
an underactuated robot with unknown nonlinear dynamics is not triv- 
ial, model-based behavioral cloning methods can address this problem 
by exploiting the learned forward dynamics model. This method re- 
quires an iterative learning process to obtain a policy that reproduces 
the expert’s trajectory. 


3.9.3 Learning to Control with DAGGER 


Ross et al. [2011] demonstrated how the DAGGER algorithm learns to 
play a video game as shown in Figure 3.18. Visual features of 2D images 
were used as system state, and a policy linear to the visual features was 
learned using DAGGER . A human expert demonstrated the correct 
steering for observed game images. DAGGER has also been applied to 
control UAVs as shown in Figure 3.18 [Ross et al., 2013]. Ross et al. 
[2013] trained a controller that can avoid trees in natural environments 
using a small set of human demonstrations and performed autonomous 
flights in a real forest. In both examples, a small error at an early time- 
step may lead the learner to an unseen state which largely deviates from 
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expert demonstrations. Since the learner encounters various states in 
which the expert did not demonstrate how to act, an online learning 
approach such as DAGGER is essential in these applications. 


A 


Inverse Reinforcement Learning 


In inverse reinforcement learning (IRL) [Russell, 1998], also called in- 
verse optimal control [Kalman, 1964, Moylan and Anderson, 1973, Dvi- 
jotham and Todoroy, 2010, Levine and Koltun, 2012], inverse planning 
[Baker et al., 2009], or structural estimation of MDPs Rust [1994] the 
learner tries to recover a reward function from a policy (or demon- 
strations of a policy). Recovering the reward function can be beneficial 
when the reward function is the most parsimonious way to describe the 
desired behavior. 

We begin discussion of inverse reinforcement learning (IRL) with 
a definition of IRL in §4.1, discuss the critical assumption of linear 
vs. nonlinear reward functions in §4.2, continue with model-based IRL 
methods in §4.4 and model-free IRL methods in §4.5, give an informa- 
tion theoretic interpretation of IRL methods in §4.6, show how partial 
observability affects IRL in §4.7, and, finally finish with applications of 
IRL in §4.8. 
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4.1 Problem Statement 


Russell defines the problem of IRL [Russell, 1998] as follows: 


Given 1) measurements of an agent’s behavior over time, in a 
variety of circumstances, 2) measurements of the sensory inputs to 
that agent; 3) a model of the physical environment (including the 
agent’s body). 

Determine the reward function that the agent is optimizing. 


A common assumption in IRL is that the demonstrator utilizes 
a Markov decision process (MDP) for decision making. Formally, an 
MDP is a tuple (¥,U, P, y, D, R). & is a finite set of states; U is a set 
of control inputs; P is a set of state transitions probabilities; y € [1,0) 
is a discount factor; D is the initial-state distribution from which the 
initial state ap is drawn; and R : X + R is the reward function. In 
addition, many IRL methods assume that there are vectors of features 
$ : X + [0,1]*. IRL methods often estimate the reward function as a 
function of these features @. 

The goal of IRL is to recover the unknown reward function R(T) 
from the expert’s trajectories. However, since a policy can be optimal 
for multiple reward functions, the problem of determining the reward 
function is “ill-posed”. To obtain the unique solution in IRL, many 
studies have proposed additional objective functions to be optimized, 
such as margin between the optimal policy and others [Ng and Russell, 
2000, Abbeel and Ng, 2004, Ratliff et al., 2006b,a, 2009, Silver et al., 
2010] and to maximize the entropy [Ziebart et al., 2008, Ziebart, 2010, 
Kitani et al., 2012, Shiarlis et al., 2016]. 

Many IRL methods usually require an iterative learning process (al- 
though see Ratliff et al. [2006b] for a description directly in terms of a 
quadratic program). Algorithm 14 summarizes a class of IRL methods 
that proceed by alternatingly solving an RL style problem and updating 
a cost function estimate. In order to obtain the performance equiva- 
lent to the expert’s policy, state-action visitation frequency u needs 
to be matched between demonstrated trajectories and the trajectories 


4.1. Problem Statement 119 


Algorithm 14 Abstract version of feature matching inverse 
reinforcement learning 


Input: Expert trajectories D = {7;}, 
Initialize the reward function and policy parameters w, 0 
repeat 
Evaluate the state-action visitation frequency p of the current pol- 
icy To 
Evaluate the objective function £ and its derivative Vw £ by com- 
paring u and the state-action distribution implied by D 
Update the reward function parameter w 
Update the policy parameter @ with a reinforcement learning 
method 
until 
return optimized policy parameters 0 and reward function param- 
eter w 


induced by the learner’s policy as indicated by Abbeel and Ng [2004], 
Ho and Ermon [2016]. The reward function parameter w is updated 
through optimizing the objective function under the expected feature 
matching constraint. This objective function is designed to estimate the 
reward function which makes the demonstrations appear more optimal 
than the current policy. The policy parameters @ are then updated 
using an optimal control solution (i.e. reinforcement learning method) 
based on the current estimate of the reward function. For this purpose, 
inverse reinforcement learning methods often have a RL style proce- 
dure in an inner loop. By repeating this process, the policy and reward 
function parameters can be obtained. 

Each IRL method has a different way of performing these steps. 
Model-based methods require the knowledge of system dynamics in 
order to evaluate the state-action visitation frequency. On the con- 
trary, model-free methods often employ sampling-based methods for 
this purpose. In order to obtain an optimal policy based on the recov- 
ered reward function, various reinforcement learning methods can be 
used. Although MDP solvers can be used for the policy optimization 
in discrete state-action space as in [Abbeel and Ng, 2004, Ratliff et al., 
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2006a], recent policy search methods can be also used. For example, 
Finn et al. [2016b] employed guided policy search [Levine and Abbeel, 
2014], and Ho and Ermon [2016] and Ho et al. [2016] employed trust 
region policy optimization [Schulman et al., 2015]. 


4.2 Model-Based and Model-Free Inverse 
Reinforcement Learning Methods 


As with behavioral cloning methods, IRL methods can be categorized 
into two categories: model-based and model-free methods. Model-based 
IRL methods assume that the dynamics of the system, e.g. state tran- 
sition probabilities, are known. The prior knowledge of the system dy- 
namics is often used to evaluate and update the learned reward function 
and policy. These model-based IRL method are relatively simple to im- 
plement when the system dynamics are known. However, it is challeng- 
ing to apply model-based IRL methods to applications with nonlinear 
dynamics, which are hard to estimate. On the other hand, model-free 
IRL methods do not require prior knowledge of the system dynamics. 
Model-free IRL methods evaluate and update the learned reward func- 
tion and policy using sampling-based methods, which can be applied 
to systems with nonlinear dynamics. However, it is necessary to sample 
many trajectories to estimate the trajectory distribution, which can be 
time-consuming and computationally expensive. Table 4.1 summarizes 
the advantages and disadvantages of model-free and model-based IRL 
methods. 


4.3 Design Choices for Inverse Reinforcement Learning 
Methods 


In addition to design choices we described in Chapter 2, there are IRL 
specific design choices: 


1. What objective should be used to obtain the unique so- 
lution in IRL? As discussed in §4.1, IRL itself is an ill-posed 
problem, and it is necessary to design the objective function so 
as to obtain the unique solution in IRL. Table 4.2 summarizes 
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different objectives for learning reward functions. As shown, the 
maximum entropy principle is a popular choice in recent studies 
on IRL, although the concept of maximizing the margin between 
the optimal policy and others was popular in the early studies on 
IRL. The maximum entropy principle is well-founded in informa- 
tion theory, and we review the related IRL methods in §4.4.3. 


2. Should the reward function be linear or nonlinear to 
the features? Although many IRL methods employ a reward 
function linear to the features, complex tasks in robotics require 


Table 4.1: Advantages and disadvantages of model-based and model-free methods 
in inverse reinforcement learning. Model-based IRL methods can be more data- 
efficient compared to model-free methods. However, it is challenging to apply model- 
based IRL methods to systems with nonlinear dynamics. Model-free IRL methods 
have been applied to systems with nonlinear dynamics. 


mate the trajectory distri- 
bution. 


Model-free Model-based 
Applicable to systems | Estimation of the trajec- 
Advantages with nonlinear and un- | tory distribution is data- 
known dynamics efficient. 
It is necessary to sample | Model learning can be 
Disadvantages | ™2Y trajectories to esti- | very difficult. 


It is hard to apply to un- 
deractuated systems. 


Table 4.2: Objectives to obtain the unique solution in inverse reinforcement learn- 
ing. The concept of maximizing the margin between the optimal policy and others 
was popular in the early studies on IRL. The maximum entropy principle is a dom- 
inant choice for recent IRL methods. 


Objectives 
Maximum margin 


Employed by 

[Ng and Russell, 2000, Abbeel and Ng, 2004, 
Ratliff et al., 2006b,a, 2009, Silver et al., 2010, 
Zucker et al., 2011] 

[Ziebart et al., 2008, Ramachandran and Amir, 
2007, Choi and Kim, 2011b, Ziebart, 2010, 
Boularias et al., 2011, Kitani et al., 2012, 
Shiarlis et al., 2016, Ho and Ermon, 2016, Finn 
et al., 2016b] 

[Doerr et al., 2015, Arenz et al., 2016] 


Maximum entropy 


Other 
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a nonlinear reward function. On the other hand, IRL with the 
reward function nonlinear to the features is more challenging 
than IRL with the linear reward functions. Therefore, we need 
to consider the most parsimonious representation of the reward 
function among sufficiently expressive ones. 


Table 4.3 shows categorization of the existing IRL methods. As 
one can see, many IRL methods are model-based and use the 
linear reward function. On the contrary, model-free methods with 
nonlinear reward functions have not been investigated well. 


In the next section, we review model-based IRL methods, and there- 
after, we review model-free IRL methods. 


4.4 Model-Based Inverse Reinforcement Learning Meth- 
ods 


In this section, we review model-based IRL methods, which leverage 
prior knowledge about system dynamics. 


Table 4.3: Categorization of existing inverse reinforcement learning methods. How- 
ever, tasks such as manipulation in robotic applications require a nonlinear reward 
function. 


et al., 2008, Ziebart, 2010, 
Levine and Koltun, 2012, 
ee eel 2 ees eyes ae ie Hadfield-Menell et al., 2016] | 


| [Ratliff et al., 2006a, 2009, 
Nonlinear [Finn et al., 2016b, Ho and | Silver et al., 2010, Grubb 

I 

I 


Model-free Model-based 

[Abbeel and Ng, 2004, 

| Ratliff et al., 2006b, Silver 

et al., 2010, Ramachan- 
Linear [Boularias et al, 2011, ı dran and Amir, 2007, Choi 
reward Kalakrishnan et al., 2013] and Kim, 2011b, Ziebart 

l 

l 

I 

l 


reward Ermon, 2016] and Bagnell, 2010, Levine 
et al., 2011] 
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Algorithm 15 IRL by expected feature matching [Abbeel and Ng, 
2004] 


Input: Dataset of the demonstrations D = {(a;,u;)}4_,, termina- 

tion threshold e 

Randomly pick some policy në 

Compute u? using D 

Perform rollouts and pl = u(n}) 

Seti=1 

repeat 
Compute t = maxy\jy|],<1 MINje{o,.. i1} w | (U? — uF) 
Compute the optimal policy m} based on r(x) = w! (x) 
Compute uy = p(T) 
Seticit+l 

until t < e€ 

return 7’:i=0,...,n 


4.4.1 Feature Expectation Matching 


Abbeel and Ng [2004] proposed to match the feature expectation in 
order to solve IRL problems. If we assume the reward function is linear 
w.r.t. the features, the reward function is given by 


r(x) = w' (x), (4.1) 


where (æ) is the feature vector of the state x, and w is a weight 


(4.2) 


vector. Therefore, the expected reward of a policy 7 is given by 


ys TE T 
E[R|x] = E [Erre -| =E [Eres -| =w! E [Erse 
t=0 t=0 t=0 


Abbeel and Ng [2004] defined the feature expectation of a policy 7 as 


T 
u(r) =E È y'plz:) 
t=0 


7 eR. (4.3) 


Using this notation the value of a policy can be rewritten as 


E[R|n] = w" p(n), (4.4) 
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where R = 371-9 7'r(az). Based on this matching of the feature ex- 
pectation, Abbeel and Ng [2004] proposed to learn the policy from 
demonstrations so as to maximize the difference between the optimal 
policy and others. Maximization of the difference between the optimal 
policy and others was formulated as a quadratic program. By iter- 
atively updating the learned policy, the algorithm finds the optimal 
policy close to the demonstrated policy. Algorithm 15 summarizes the 
method in [Abbeel and Ng, 2004]. 

The matching feature expectation appears also in other IRL meth- 
ods, such as Ziebart et al. [2008]. However, matching the expected 
feature count is ambiguous since multiple policies can achieve the same 
expected feature counts. Therefore, it is necessary to use additional 
conditions that should be satisfied by the optimal policy. 


4.4.2 Maximum Margin Planning 


To obtain the unique solution in IRL, Ratliff et al. [2006b] proposed 
maximum margin Planning (MMP). The idea of MMP is to find the 
cost function that maximizes the difference between the optimal pol- 
icy and others. MMP finds the cost function in which the cost of the 
demonstrated trajectory C(Tdemo) is lower than the cost of other al- 
ternative trajectories C(T) by a certain margin. This constraint can be 
expressed as 


Cn < C(r) — L(r), (4.5) 


where L(T) is the loss function. If the loss function L(7) is large, the 
cost difference between the demonstrated trajectory and other trajec- 
tories is large. Since we need to consider only the minimizer of the 
right-hand side of (4.5), (4.5) can be rewritten as 


C(r#™°) < min{C(r) — L(r)}. (4.6) 


In MMP in [Ratliff et al., 2006b], it is assumed that the cost function 
is linear to the features of the trajectory as C(T) = w'@(T) where w 
is the weight and (7) are the trajectory features. If the trajectory fea- 
tures (T) are linear to the state-action frequency counts u € RI¥IUI, 
$(r) is given by ọ(T) = Fp where F € RIXIYIUI is the feature matrix. 
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Likewise, if the loss function L(7) is linear to u, the loss function of the 
trajectory is given by L(+) = l! where L € R'*!“ is the loss vector. 
Given a training set D = {F;,7;,1;},, the problem of finding w can 
be formalized as a quadratic program: 


IL a, Wee 
min 5 leo" + NG (4.7) 
s.t.Vi, w! o;(7;) < min {wT lT) = u} +Ġi (4.8) 


The slack variable {¢}*_, allows the violation of the constraints in a 
similar manner as in support vector machines [Vapnik, 1998]. If we use 
a slack variable G = w! Fy; — ming {wl Fip = 1 wh, the objective 
function can be obtained as 


N 


1 . A 
Lxmr(w) = y X (w Ra -min {w Fy — Wap) + lw, 
i=1 


(4.9) 


which Ratliff et al. [2009] call the maximum margin objective where 
A > 0 is the regularization parameter. 

For solving this problem, a method based on subgradients is used 
in Ratliff et al. [2006b]. MMP assumes access to a MDP solver that 
returns the optimal trajectory by solving the problem 


7* = argminC (T), (4.10) 


where C(T) is the cumulative cost of the trajectory T. MMP uses the 
loss-augmented cost map C(r) = C(r) — L(T) to plan the trajectory. 
Algorithm 16 summarizes the procedure of MMP. 

The MMP framework was extended to LEARCH (LEArning to 
seaRCH), which is a framework for learning nonlinear cost functions 
efficiently [Ratliff et al., 2009, Silver et al., 2010, Zucker et al., 2011]. 
In LEARCH, exponential functional gradient descent was used for op- 
timizing the maximum margin planning objective. 

The policy obtained in MMP is based on efficient MDP solvers, 
which generate deterministic optimal policies. However, robotic sys- 
tems with large configuration space dimensionality often require a 
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Algorithm 16 Maximum margin planning Ratliff et al. [2006b] 


input: Training set D = {F;,7:,1;}*_,, regularization parameter À > 
0, stepsize sequence {az}, iteration T 
while t< T do 
for i=1,...,N do 
Compute the loss-augmented cost map č; = w! F; — 1 
Compute the optimal trajectory 7; = arg min Gps 
Compute the state-action frequency couts pF 
end for 
Compute the subgradient g € ðLmmp (w) 
w 4+ w -— ag 
(Optional) Project w on to any additional constraint 
t+t+1 
end while 
return w 


stochastic policy and approximations in planning [Ratliff et al., 2009]. 
In the next section, we review the maximum entropy IRL by Ziebart 
et al. [2008] that considers the distribution of the resulting trajectories. 


4.4.3 Inverse Reinforcement Learning Based on the Maxi- 
mum Entropy Principle 


In recent studies on IRL, the maximum entropy principle [Jaynes, 1957] 
is often used to obtain the unique reward function. In the following sec- 
tion, we review IRL methods based on the maximum entropy principle. 


4.4.3.1 Maximum Entropy Inverse Reinforcement Learning 


As described in §4.1, the IRL problem is ill-posed because a policy can 
be optimal for multiple reward functions. The max-margin approach 
described in the previous section works well when there is a single 
reward function that is clearly better than alternatives. However, in 
other cases optimizing a distribution over behaviors may be preferable. 

The maximum entropy principle [Jaynes, 1957] suggests to choose 
a distribution that maximizes the entropy among the distributions 


4.4. Model-Based Inverse Reinforcement Learning Methods 127 


that matches the feature expectations of the demonstrator [Dudik and 
Schapire, 2006, Ziebart et al., 2008]. Following this principle, Ziebart 
et al. [2008] proposed to learn a policy that maximizes the entropy 


1 
H(p(r)) = $ p(r) In om) (4.11) 
subject to the constraints 
Er lo(T)] = E,x[b(7)], (4.12) 
Xel) =1, Vr, p(T) > 0, (4.13) 


where E,1[@(7)] is the expected feature count with respect to the 
learner’s policy and E,[@(7)]| is the expected feature count with re- 
spect to the expert’s policy. 

Among the distributions that satisfy E,1[6(7)| = E,2[@(7)], the 
maximum entropy distribution follows 


p(T) x exp (R(T)), (4.14) 


where p(T) is the probability of the trajectory T, and R(T) = w! (T) 
is the reward of 7. The parameter vector w is the Lagrangian multiplier 
for the feature matching constraint. Hence, we can see that, due to the 
feature matching constraint, the reward function is linear in the trajec- 
tory features. The probability of the trajectory can hence be expressed 
as 


1 
plrlw) = zry exp (WTA) (4.15) 
where Z(w) is the partition function given by Z(w) = 
Er exp (w' 9(7)). 


However, Equation 4.15 only holds for deterministic environments. 
For stochastic environments, the trajectory distribution is also affected 
by the transition probabilities, i.e., 


exp (w"(r)) II p(xt41\Ue, et). (4.16) 


Lt+1,Ut, ELET 


(rlw) = =- 
T|w) = 
P Z(w) 
The implication of this observation is that the agent is now trying to 
optimize 
R(r) =w!' (7) + S > log p(wi1|ue, zt), 

t 


128 Inverse Reinforcement Learning 


where we have a bias term due to the stochasticity of the environment. 
This is one of the main theoretical drawbacks of maximum entropy 
IRL, which is addressed by follow-up work such as the maximum causal 
entropy IRL [Ziebart, 2010]. 

The parameter w of the reward can be obtained by maximizing the 
likelihood of the observed data under the maximum entropy distribu- 
tion as 


w* = arg max Lp (w) = arg max > In p(r2™? | w). (4.17) 


demo 


Since maximizing the likelihood is equivalent to the M-projection, this 
problem formulation can be interpreted as M-projection onto the mani- 
fold of the maximum entropy distribution, which we discussed in §2.7.1. 
Since the objective function Lmg(w) is convex, this optimization can 
be solved using gradient-based methods. The gradient is given by the 
difference between the empirical feature counts from demonstrations 
and the expected feature counts from the learner’s policy as 


VLumelw) = E,2[¢(7)] -$ p(Triw)elT) = E,2[0(7)] - Dd, Dabla). 
© (418) 


If (T) = Lo h(a), then the expectation over the state-features 
p(x) can be computed by estimating the expected state visitation fre- 
quencies Dy, of the current reward model, at least in discrete domains. 
For computing these frequencies, a backward-forward message passing 
algorithm can be used. Algorithm 17 summarizes the procedure for 
computing the state visitation frequencies. 

Although the maximum entropy IRL proposed by Ziebart et al. 
[2008], Rust [1994] works well in MDP problems, it assumes that the 
state transition distribution is known, which is not the case in many 
robotic applications. Sampling-based or model learning extensions must 
be applied for problems where the model is not specified. 
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Algorithm 17 Expected edge frequency calculation [Ziebart et al., 
2008] 

Backward pass 

Set Z renna =1 

Recursively compute for N iterations 

Zuig = D pl£k|£i, Ui, 5) exp(R(xi|w))Zz, 

Ze, = Lar Lui; 

Local action probability computation 


Zuij 


plui jlæ:) = z 
Forward pass 

Set Dait = p(x; = Tinitial) 

Recursively compute for t = 1 to N 

Deptt = La, Xu; p Dep tpui jlt) pEr] Li, wij) 
Summing frequencies 

Dz, = ee Dg, t 


4.4.3.2 Maximum Causal Entropy Inverse Reinforcement 
Learning 


In order to fix the theoretical drawbacks of max-ent IRL in case of 
stochastic dynamics, Ziebart [2010] proposed to use the maximum 
causal entropy for IRL. The key idea of the causal entropy is that 
action choices need to be causal, i.e., the action selection at time step 
t needs to be independent from future states in the trajectory. Using 
these insights, a new algorithm can be developed that also incorporates 
the stochasticity of the dynamics in the reward estimation. Contrary 
to maximum entropy IRL, maximum causal entropy IRL removes the 
“bonus entropy” that is due to the stochastic dynamics of an envi- 
ronment itself. This prevents learning policies that simply attempt to 
target areas in state-space of high stochasticity. 

Maximum causal entropy IRL Ziebart [2010], tries to find the pol- 
icy m*(u|æ), which maximizes the causal entropy H(u1-r||x1.7) of the 
actions given the states, i.e, 


m* (ula) = argmax H(uy.7||x1-7) (4.19) 
nl (ula) 
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subject to the constraint of feature expectation matching 


Et [O(7)] = Exe [p(r°™°)], 
X r (ula) =1, a (ulx)>0, (4.20) 


where the feature function (7) = >>, p(X, uz) is given by the sum 
over state-action features. The causal entropy is defined as 


T 


H(urr||ærr) =>) H (u|u, £1) (4.21) 
t=1 


T 
=- SO p(s, £r) ln (mr (ular, wit-1)) , 


t=1 U1:t,£1:t 


where H(u,|t14-1, £14) is the conditional entropy and p(uj., 214) is 
the joint distribution over all states and actions until time step t. Con- 
trary to the conditional entropy H (u1:r|æı:r), that is implicitly used 
in standard max-ent IRL, the causal entropy H(uj.7||x1:-r) conditions 
action choices at time step t only on states until time step t, while the 
conditional entropy would make the action choice also dependent on 
future states (i.e., it ignores the causality). 

Under the assumption that the system is Markovian, 
p(x+|%1:4-1, Uitz—1) reduces to p(xz|Xe-1, Ue_1), and T(t Eit, U14-1) 
reduces to 7(uz|x%z). Causal entropy can be maximized using dynamic 
programming [Ziebart, 2010] resulting in equations similar to those 
found in soft value-iteration methods. 


4.4.3.3 IRL from Failed Demonstrations 


Although the usual aim of inverse reinforcement learning is to learn an 
optimal policy from demonstrated successful trajectories, failed demon- 
strations also contain information that can be used for learning. Shiarlis 
et al. [2016] extends the maximum causal entropy IRL [Ziebart, 2010] 
method to learning from failed demonstrations. When using the max- 
imum entropy approach for learning from successful demonstrations, 
the learned feature expectations should be similar to the demonstrated 
ones. In order to take failed demonstrations into account, Shiarlis et al. 
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[2016] modifies the maximum causal entropy IRL [Ziebart, 2010] opti- 
mization problem so that the optimized policy favors trajectories with 
features which are dissimilar to the features found in failed demonstra- 


tions 
= x 
max H(urr||ærr)+ >> Wkk — allel? (4.22) 
k=1 


wh (ulx),w,z 


subject to 


EL (ula) [d(7s)] = 5 (b(r ge"), 
Eat (ule) [P(TF)] — Ere [o(r#™°)] = ze, 


Son" (ula) =1, r*(ulx) >0, 


where A is a constant, K is the number of features, and w are fea- 
ture weights to optimize. While the original maximum causal entropy 
approach used only features of successful demonstrations o(rg™) the 
approach of Shiarlis et al. [2016] uses also failed demonstration features 
o(Tp). The term JE; wgzg favors large distances between policy gen- 
erated features and features in failed demonstrations. 4 ||| |? is a reg- 
ularization term to keep w small enough. In order to find a solution to 
the program in Equation 4.22, Shiarlis et al. [2016] performs gradient 
ascent to find the feature weights while incrementally decreasing A until 
hitting a threshold. The idea in this procedure is to first emphasize 
finding good weights for successful demonstrations and then focus on 
finding weights for failed demonstrations. 


4.4.3.4 Connection of Maximum Entropy Methods to Eco- 
nomics 


For discrete MDPs, the Boltzmann policy form and closely-related dy- 
namic programs have been developed in the econometrics community 
under the rubric of “structural estimation” from a completely different 
analysis. Notably, Rust [1994] derived predictive distributions of agents’ 
actions by developing a framework for learning cost functions and pre- 
dictive stochastic policies for agents acting according to a Markov De- 
cision Process. Intriguingly, the MaxEnt policy structure and the dy- 
namic programming algorithms derived from the maximum entropy 
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formulation arise as well by considering an economist with only partial 
access to the prediction problem and including random “shocks” in a 
model of what would otherwise be optimal behavior. These close con- 
nections between operations research (“structural estimation”), con- 
trol theory (“inverse optimal control”) and machine learning (“inverse 
reinforcement learning”) deserve much deeper investigation and better 
cross-fertilization between communities. 


4.4.4 Miscellaneous Important Model-Based IRL Methods 


Although the maximum entropy principle is becoming dominant in 
recent studies on IRL, various other model-based IRL methods have 
been proposed. We review some of them in the following sections. 


4.4.4.1 Linearly-Solvable MDPs 


The linearly-solvable MDP approach of Dvijotham and Todorov [2010] 
differs from standard inverse reinforcement learning approaches since 
it estimates a value function instead of a reward or cost function. A 
reward function can be used to optimize a policy under different system 
dynamics but a value function may require system dynamics similar to 
those used for learning the value function. 

The linearly-solvable MDP approach of Dvijotham and Todorov 
[2010] is designed to not require solving an MDP repeatedly. Dvi- 
jotham and Todorov [2010] assume a special kind of linearly-solvable 
MDP where the system dynamics are divided into passive dynamics 
and policy specific active dynamics. The cost function is a combination 
of state specific cost c(a) and the cost on the difference between passive 
dynamics p(a++1|a) and policy specific dynamics 7(a++1|a+): 


c(az, T) = c(az) + Dx (|p) . (4.23) 


While the maximum entropy approach of [Ziebart et al., 2008] prefers 
exponentially larger rewards, Dvijotham and Todorov [2010] prefers 
exponentially larger value functions of the next state which is influenced 
by the policy a: 


p(#141|%4)2(2441) 
F p] 


a pig |p) = (4.24) 
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where z(a41) = exp (V (æ:+1)) is the desirability function, Z is for 
normalization, and V(a41) is the value function. Note that the pol- 
icy m(a4+1|a) is a scaled version of the passive transition probabilities 
p(az41\a). The IRL problem is then to estimate the value function 
from state transition samples. Dvijotham and Todorov [2010] finds the 
maximum likelihood value function from an unconstrained convex op- 
timization problem. The advantage of the approach is that it does not 
require solving the MDP repeatedly. Disadvantages are that in continu- 
ous states spaces Dvijotham and Todorov [2010] needs to approximate 
the value function which may be more challenging then approximating 
reward functions which is the common approach in IRL. Moreover, a 
learned reward function can be used under different dynamics while 
this can be challenging for a value function which has been optimized 
for specific application dynamics. 


4.4.4.2 IRL Methods Based on a Bayesian Framework 


The Bayesian framework is a powerful tool in machine learning which 
allows updating the current hypothesis based on new evidence. Ra- 
machandran and Amir [2007] proposed an IRL method based on the 
Bayesian framework. In this framework, the action of the expert is 
considered as evidence that can be used to update a prior on reward 
functions. As in [Ziebart et al., 2008], a (different) log-linear distribu- 
tion is assumed, and the posterior probability of the reward function 
can be computed using Bayes theorem as 


ID L Zexwlak(r. RR), (42) 


which can be considered as a Boltzmann-type distribution with energy 


p(R]r) = 2 


E(t, R). Computing the mean of this posterior distribution requires 
to recover the reward function and to learn the optimal policy from 
demonstrations. In the study by Ramachandran and Amir [2007], an 
MCMC algorithm was used to generate samples from distributions and 
the sample mean was used as an estimate of the mean of the true 
distribution. 

Instead of computing the posterior mean, Choi and Kim [2011b] 
proposed to use maximum-a-posterior(MAP) inference. The IRL prob- 
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lem with MAP inference can be formulated as finding the reward func- 
tion Ryap that maximizes the posterior 


Rmap = arg max p(R|D) = arg max [Inp(D|R)+Inp(R)], (4.26) 


where D = {(a;,uz)} is a set of state-action pairs demonstrated by 
the expert. The likelihood p(R|D) can be interpreted as a measure 
of the compatibility of the reward function R with the demonstrated 
behavior data D. For solving this problem, the method in Choi and 
Kim [2011b] used gradient-based optimization. Choi and Kim [2011b] 
suggested that MMP, Maximum entropy IRL, and other IRL methods 
can be interpreted in a Bayesian framework. 


4.4.5 Learning Nonlinear Reward Functions 


While research on inverse reinforcement learning originally focused 
mostly on learning reward functions linear with respect to feature vec- 
tors [Abbeel and Ng, 2004, Ziebart et al., 2008, Ratliff et al., 2006a, 
Boularias et al., 2011], many tasks, for example in robotics, require non- 
linear reward functions [Silver et al., 2010, Ratliff et al., 2006b, Grubb 
and Bagnell, 2010, Levine et al., 2011, Finn et al., 2016b]. We discuss 
below such model-based approaches for modeling nonlinear rewards. 


4.4.5.1 Boosting Methods 


The earliest approaches to rich reward function learning from model 
classes with high representational power was the use of gradient- 
boosting. These methods, typified by Ratliff et al. [2006b], Silver et al. 
[2010], Ratliff et al. [2009] can use arbitrary supervised learning algo- 
rithms in an ensemble to create highly non-linear cost functions. This 
approach has been used to learn locomotion strategies by demonstra- 
tion Zucker et al. [2011] as well as to learn to match the real-world, 
rough, terrain driving strategies Silver et al. [2010, 2016, 2013]. These 
are among the easiest and most general approaches to implement, and 
an example of their use is discussed in 4.8.2. 
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4.4.5.2 Deep Network Methods 


Deep neural approaches to complex IRL cost functions were first 
demonstrated in Grubb and Bagnell [2010], Bradley [2010]. These ap- 
proaches both build on the maximum margin formalism (although ap- 
ply equally to related ones like Maximum Entropy), and use variants 
of backpropagation to learn sophisticated cost functions from demon- 
strations for interpreting sensor data. 


4.4.5.3 Gaussian Process IRL 


To learn a nonlinear reward function, Levine et al. [2011] use a Gaus- 
sian Process (GP) approach based on the maximum entropy princi- 
ple [Ziebart et al., 2008]. The original maximum entropy based ap- 
proach Ziebart et al. [2008] uses linear reward features for the reward 
function. Levine et al. [2011] use GP inverse reinforcement learning 
(GPIRL) to represent a reward function which is nonlinear in the fea- 
tures. In general, a GP [Rasmussen and Williams, 2006] defines a proba- 
bility distribution over possible outputs given some input coordinates, 
and, kernel hyperparameters define the actual shape of the GP. In 
GPIRL, the kernel hyperparameters @ define the shape of the reward 
function, manually chosen feature coordinates @,, correspond to input 
coordinates, outputs correspond to demonstrated actions, and a GP 
models the probability distribution over true actions u. The probabil- 
ity distribution over u and @ is 


plu, OD, $a) « | [clr )prlu,6,4,)dr] plu. 8lby). (427) 


where p(D|r) is the distribution over demonstrated trajectories and is 
given by the maximum entropy principle yielding trajectories exponen- 
tially more likely closer to larger rewards. p(r|u, 0, @,,) is the condi- 
tional GP posterior reward probability, and p(u, @|@,,) is the prior GP 
probability for u and @. In order to compute 4.27, Levine et al. [2011] 
use several approximations. The choice of @,, is particularly important 
since it has a large impact on both whether the solution covers the 
true reward function and on the computational requirements: GPs are 
computationally intensive because of the required covariance matrix 
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inversion where the size of the matrix depends on input space size. 


4.4.6 Guided Cost Learning 


Recently, Finn et al. [2016b] extended the use of non-linear neural cost 
function approach described above Grubb and Bagnell [2010] using an 
adaptive sampling scheme rather then an analytic approximation as 
the policy optimization step in an unknown Markov Decision Process. 
In order to solve the cost function non-uniqueness problem as well as 
imperfect demonstration, Finn et al. [2016b] use the popular maximum 
entropy principle Ziebart et al. [2008]. For optimizing a policy and 
learning the cost function, the approach of Finn et al. [2016b] repeats 
two steps: 1) updates the cost function based on samples from both 
the policy and demonstrations, 2) updates the policy based on the new 
cost function. 

Guided cost learning finds the maximum likelihood solution un- 
der the maximum entropy principle as in [Ziebart et al., 2008]. Under 
the maximum entropy assumption, the probability distribution of the 
trajectory T is given by p(T) = 4 exp(—cw(T)), where cw is the cost 
function parameterized with a vector w. The objective function Lacr 
of the guided cost learning is given by the negative log-likelihood of the 
maximum entropy distribution 


1 


Lech = 7 X cw(ti) +nZ (4.28) 
Tj €Ddemo 
1 1 exp(—cw(T;)) 
s> D amtii Y See (429) 
TiEDdemo M Tj EDsamp a(T;) 


where Daemo is the set of demonstrated trajectories, Dsamp is the set of 
samples, and q is the distribution from which the T; is sampled. The 
gradient of the cost cy with respect to the parameter can be efficiently 
computed when the cost is represented by a neural network. 
Algorithm 18 summarizes the approach. In more detail, at each it- 
eration, Finn et al. [2016b] samples additional trajectories using the 
current policy and a black box simulator. Next, the cost function is up- 
dated based on all sampled trajectories and the demonstrations. The 
parameters of the neural network, representing the cost function, are 
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Algorithm 18 Guided cost learning Finn et al. [2016b] 


Initialize q,(7) at either a random initial controller or from demon- 
strations 


for iteration i = 1 to J do 
Generate samples Diraj from q(T) 
Append samples: Dsamp <— Dsamp U Ptraj 
Use Dsamp to update the cost cy using Algorithm 19 
Update q,(7) using Dtraj and the method from Levine and Abbeel 
[2014] to obtain qk+1(T) 
end for 
return optimized cost parameters w and trajectory distribution 


q(T) 


updated based on the gradient computed using the exponential cost 
typical for maximum entropy based approaches. For updating the pol- 
icy based on the new cost function and samples, Finn et al. [2016b] 
uses a constrained version of linear quadratic regular (LQR) based 
trajectory optimization together with linearizing dynamics of local ap- 
proximate Gaussian distributions estimated from the samples [Levine 
and Abbeel, 2014]. 

The approach of Finn et al. [2016b] has several interesting proper- 
ties. Firstly, the policy optimization part of the approach is designed for 
smooth continuous trajectories found e.g. in robotics. Secondly, the ap- 
proach requires a black box simulator but no explicit dynamics model. 

Recently, Finn et al. [2016a], Ho and Ermon [2016] identified the 
close connection between Inverse Reinforcement Learning and the more 
recent generative adversarial networks [Goodfellow et al., 2014]. In gen- 
erative adversarial networks, a generative model G is trained to gen- 
erate data samples so as to mimic the true data distribution, while 
the discriminator D is trained to discriminate the data generated by 
G and the true data. These works demonstrate that optimization/RL 
play the role of a generator while the learned cost function plays the 
role of a discriminator, albeit with the generalization of applying to 
any trajectory a system could take. This viewpoint sheds light on the 
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Algorithm 19 Nonlinear IOC with stochastic gradients [Finn et al., 
2016b] 


for iteration k = 1 to K do 
Sample demonstration batch Dgemo C Ddemo 


Sample background batch Diap C Damp 
Append demonstration batch to background batch 
Diamo & Didero U Damy 
Estimate dEcor (yy) using Dgemo and Damp 
Update parameters w using gradient dEGCL (w) 
end for 
return optimized cost parameters w 


instabilities of GANs and the potential power of combining algorithms 
used in each field. 


4.5 Model-Free Inverse Reinforcement Learning Meth- 


ods 


In robotics and other application fields, exact dynamics models are of- 
ten difficult to come by. Model-free IRL methods side step the problem 
by not requiring such prior knowledge. Model-free IRL methods often 
employ sampling-based approaches to estimate the trajectory distribu- 
tion. Although this approach requires many samples of trajectories in 
the learning process, it avoids the explicit learning of system dynamics. 


4.5.1 Relative Entropy Inverse Reinforcement Learning 


Although model-based IRL methods assume that the system dynam- 
ics, e.g. state transition probability, is known, model-free IRL methods 
do not require such prior knowledge on the system dynamics. Relative 
entropy IRL in [Boularias et al., 2011] is one of such model-free IRL 
methods. Boularias et al. [2011] proposed to minimize the relative en- 
tropy between a prior trajectory distribution go(7) induced by a base- 
line policy and the trajectory distribution p(T) induced by the learner’s 
policy. For minimizing the relative entropy without prior knowledge of 
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the system dynamics, importance-sampling is used to estimate the ex- 
pected feature count in [Boularias et al., 2011]. Relative entropy IRL 
also assumes that the reward is given as a linear function of the fea- 
ture vector as R(T) = w! (T). This problem can be formulated as 
minimizing the relative entropy 


min $` p(T) In a (4.30) 
subject to the constraints 
Vi € {1,...k}, |E,1[¢i(7)] — E,2[¢i(7)|| < €i, (4.31) 
er) =1, (4.32) 
TET 
VreT, p(T) 0, (4.33) 


where E_»[;(7)| is the empirical expectation of the ith feature vec- 
tor calculated from demonstrations, E,1[¢;(7)] = >>, p(7)¢i(7) is the 
expectation of the feature vector with respect to the learner’s policy, 


k is the number of features, 7 is a set of feasible trajectories, and the 
threshold «e; is calculated by using Hoeffding’s bound. The Lagrangian 
of this problem is given by 


Leu(p, w,n) = ` p(T) In 


p(T) T oR T 
a (Sros elC i) 


k 
Shale tn (ae) = 2) i 


TET 
(4.34) 
The dual problem is given by maximizing the dual function 
k 
gre(w) = w! EE [o(rT)] — ln Z(w) — 5 Jwilei. (4.35) 
i=1 


This dual problem can be solved by using a sub-gradient-based method 
and importance sampling in Boularias et al. [2011]. Since the expected 
feature count is estimated through sampling, this method can be ap- 
plied to a system with unknown dynamics. 
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Algorithm 20 Generative adversarial imitation learning 


Input: Expert trajectories D = {7;}4_,, initial policy and discrimi- 
nator parameters 09, wo 
for iteration i = 1 to K do 
Sample Trajectories T; ~ m} 
Update the discriminator parameters from w; to wi+ı with the 
gradient 
a [Vw In(Dw»(s,a))] + Er [Vw ln (1 — Dw(s, a))] 
Update a policy nE using the TRPO rule with the cost func- 


tion In(Dw,,,(s,@)), which takes a KL-constrained natural gra- 
dient step with 
Et [Von x™(u|x)Q(a, u) — VoH(n*)], 

where Q(z, u) = E [ln (D 
end for ' 


wi, (£, tt) |£0 = Z, uo = U] 


return optimized policy parameters 0 


4.5.2 Generative Adversarial Imitation Learning 


Recently, Ho and Ermon [2016] proposed generative adversarial imita- 
tion learning (GAIL) by leveraging the connection noted above between 
GANs [Goodfellow et al., 2014] and IRL. t This viewpoint enables con- 
straining the behavior of the agent to be approximately optimal ac- 
cording to an unknown reward function without explicitly attempting 
to recover that reward function. 

Ho and Ermon [2016] trained a policy that reproduces the expert’s 
behavior and a discriminator that distinguishes trajectories induced by 
the learner’s policy from trajectories demonstrated by the expert. The 
state-action occupancy induced by the expert’s policy in GAIL is anal- 
ogous to the true data distribution in GANs. Algorithm 20 summarizes 
GAIL. Ho and Ermon [2016] indicated that IRL is a dual of the occu- 
pancy measure matching under the maximum entropy principle. Based 


1GAIL [Ho and Ermon, 2016] cannot be fully classified as an IRL approach since 
GAIL does not recover the reward function. However, we introduce the study [Ho 
and Ermon, 2016] in the IRL section since it is relevant to the concept of IRL. 
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on this consideration, the objective function 


Loa = Ep; [hn(Dy(#,1s))] — Epe [In(1 — Dy(«,u))] — AH (Y) (4.36) 


is optimized to match the occupancy measure, where ay is the learner’s 
policy parameterized with 0, Dw is the discriminator network parame- 


terized with w, H(r4) = arl- ln rẹ (u|æ)] is the y-discounted causal 


entropy of the policy 7} in [Bloem and Bambos, 2014]. Through op- 
timizing Loa, the discriminator network Dw and the policy ay are 
trained. Here, trust region policy optimization (TRPO) proposed by 
Schulman et al. [2015] is used to optimize Laa with respect to the pol- 
icy parameter 8. TRPO employs the constraint between the current 
and updated policies in order to avoid unstable policy updates. For 
this purpose, the KL divergence is used as a measure of the dissimilar- 
ity of policies in TRPO. 

Recent work by Baram et al. [2017] extended GAIL to the model- 
based approach. Baram et al. [2017] proposed to make the computation 
for training a stochastic policy fully differentiable by using a forward 
model. The empirical results show that the model-based GAIL outper- 
forms the model-free GAIL in continuous control tasks. In addition, the 
work by Henderson et al. [2018] extended GAIL to the option frame- 
work for a hierarchical policy. 


4.6 Interpretation of IRL with the Maximum Entropy 
Principle 


As we have seen so far, many IRL methods iteratively estimate the 
reward function to make the demonstrations appear more optimal than 
other policies, then update the policy under the updated reward func- 
tion, and execute the policy to get more samples which the reward 
function attempts to distinguish. This process is summarized in Fig- 
ure 4.1. To obtain the unique solution of the “ill-posed” IRL problem, 
the maximum entropy principle is often used. Here, we discuss the in- 
terpretation of IRL with the maximum entropy principle. 

Let us consider a prior trajectory distribution po(7) and the tra- 
jectory distribution p(T) induced by the learner’s policy. Information 
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Estimate the 
reward function R 


Update the policy 


Execute the learned policy 


Data manifold Policy model manifold 


Figure 4.1: Illustration of many IRL approaches. Such IRL methods iteratively 
estimate the reward function to make the demonstrations appear more optimal 
than the current policy, then update the policy under the new reward function, and 
execute the policy virtually or physically to get more samples which the reward 
function attempts to distinguish. 


geometry suggests to minimize the KL divergence Dxz(p(T)||po(7)) 
from p(T) to po(T) [Amari, 2016]. The maximum entropy principle in 
(Jaynes, 1957] suggests to choose a distribution that maximizes the 
entropy among the distributions that achieve at least the same total 
reward. Entropy H(p(7)) is defined as 


1 


H(p(T)) > X p(T) In Tea (4.37) 
whereas the KL divergence Dx (p(T)||po(7)) is defined as 
Diroll) = Epir) m AE, (4.38) 


Therefore, maximizing the entropy H (p(T)) is equivalent to minimizing 
the KL divergence Dgu(p(T)||po(T)) under the assumption that po(T) 
is the uniform distribution. Alternate prior distributions can be easily 
taken into account by simply adding a “feature” that is log po(7) either 
with a weight fixed to 1.0 or allowed to adapt and learn. 

The maximum causal entropy distribution [Ziebart et al., 2013] can 
be understood to assume to remove the effects of stochastic dynamics 
as well. For learning tasks involving physical systems, it is often desir- 
able to consider alternate po(T), particularly by exploiting information 
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in the system dynamics. For this reason, Dvijotham and Todorov [2010] 
proposed to use the trajectory distribution induced by the passive dy- 
namics p(a41|x,) of the system as the KL divergence term po(T) of 
the cost function. Kalakrishnan et al. [2013] also approximated a trajec- 
tory distribution using trajectories sampled from the system dynamics. 
These methods consider the passive dynamics of the system in their 
problem formulation. 

The relative entropy IRL approach by Boularias et al. [2011] at- 
tempts to minimize the KL divergence Dx, (p(T)||po(7)), with feature 
matching constraints. By using importance sampling, the expected fea- 
ture counts are approximated without prior knowledge of the system 
dynamics. Since the trajectories sampled from the actual system fol- 
low the system dynamics, we can consider that the expected feature 
counts approximated using importance sampling implicitly encode the 
system dynamics. Arenz et al. [2016] use the M-projection to obtain 
the data state distribution analytically, and then use the I-projection 
to obtain the policy given the analytic model of the data distribution. 
Methods that directly try to minimize the KL to the data distribution 
Dx (p(7)||q4°™°(7)), where g*°™°(r) is the trajectory distribution in- 
duced by the expert policy, have not been widely researched in imitation 
learning to our knowledge. However, some recent research shows that 
any f-divergence can be minimized [Nowozin et al., 2016] in GANs and 
given the close connection to IOC methods we expect that investiga- 
tions into this area may be profitable. 


4.7 Inverse Reinforcement Learning under Partial Ob- 
servability 


Partial observability is common in robotics and other domains due to 
sensor noise and occlusions caused by objects, robots, humans, and 
the environment. Moreover, the whole process of IRL can be seen as 
a process where the agent has incomplete observations about the true 
reward function. Here, we discuss the cases when the expert and learner 
make partial observations, and, the case of formally framing IRL as 
the learner making partial observations about the reward function. 
Section 4.7.1 discusses the case when the learner partially observes 
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the demonstrations, Section 4.7.2 then discusses the case when the ex- 
pert makes partial observations when performing demonstrations, Sec- 
tion 4.7.3 describes how IRL can be framed as a partially observable 
Markov decision process, and Section 4.7.4 discusses a model for opti- 
mizing the behavior of both the expert and learner when the reward 
function is partially observable. 


4.7.1 IRL from Partially Observable Demonstrations 


Recently, inverse reinforcement learning with partially observable ex- 
pert demonstrations has gained interest in vision research [Kitani et al., 
2012] and robotics [Boularias et al., 2012, Bogert and Doshi, 2014, 2015, 
Bogert et al., 2016]. 

Noisy sensors are a common source of partial observability. To fore- 
cast human activities from noisy images, [Kitani et al., 2012] extends 
maximum entropy IRL [Ziebart et al., 2008] into domains where the 
learner only partially observes expert demonstrations. To handle par- 
tial observability, Kitani et al. [2012] proposes to use a hidden variable 
Markov decision process (haMDP). In hMDP, observation probabilities 
are part of the joint maximum entropy state-observation probability 
distribution 
exp(w! $) 

Z(w) 
which is similar to the maximum entropy IRL trajectory probability 
distribution in (4.15), but, the state features #1 in (4.39) include the 
logarithm of the probability of the observations o. For simplicity, in [Ki- 


p(t|o, 0) = (4.39) 


tani et al., 2012], the observation probability is Gaussian. 

Boularias et al. [2012] deal with noisy features using a graphical 
model based on Markov random fields (MRFs) that allows correlation 
between actions of similar states. Intuitively, utilizing correlations re- 
duces noise due to the smoothing effect on observations over similar 
states. In many problems state similarity is easy to determine. For ex- 
ample in navigation, Euclidean distance can be used as a similarity 
measure. Boularias et al. [2012] demonstrate the approach in a simu- 
lated navigation and in a simulated grasping task. One disadvantage 
of the approach is that the algorithms presented in [Boularias et al., 
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2012] are computationally heavy. 

Motivated by occlusions in robotic problems, Bogert and Doshi 
(2014, 2015], Bogert et al. [2016] study the problem of reward learning 
from partially occluded demonstrations. Moreover, the demonstrations 
are performed by multiple experts. Contrary to [Natarajan et al., 2010], 
the experts’ policies are not independent from each other but take other 
experts into account. The methods developed in [Bogert and Doshi, 
2014, 2015, Bogert et al., 2016] are based on maximum entropy IRL 
[Ziebart et al., 2008]. To handle partial observability, Bogert and Doshi 
[2014] simply do not consider occluded states and actions, but, instead, 
compute feature expectations only for observable states. Bogert and 
Doshi [2014] demonstrate the approach in multi-robot patrolling: the 
learner has to find out the reward functions of patrolling robots in or- 
der to plan a route around them. Bogert and Doshi [2015] consider also 
uncertain transition functions. Instead of discarding partially observed 
time steps, Bogert et al. [2016] follow a different approach by treating 
missing data as hidden variables and presents an expectation maxi- 
mization (EM) approach for a locally optimal solution. Bogert et al. 
[2016] demonstrates the EM approach in a simulated reconnaissance 
scenario with dynamically changing occlusions and shows how a robot 
learns to perform a sorting task demonstrated by a human. 


4.7.2 IRL with Incomplete Expert Observations 


Usually the basic premise in IRL is that the expert observes the world 
state fully. However, similarly to the learner, the expert may only 
partially observe the world when demonstrating the task. Thus in- 
stead of an MDP model a partially observable Markov decision process 
(POMDP) model is needed for the expert. The formal POMDP model 
is identical to the MDP model except that a POMDP additionally 
includes observation probabilities conditioned on the next state and 
current action. Policy computation for POMDPs is challenging com- 
pared to MDPs. The same applies to IRL in POMDPs [Choi and Kim, 
2011a]. Choi and Kim [2011a] extend classical IRL algorithms [Ng and 
Russell, 2000, Abbeel and Ng, 2004] to two different POMDP settings: 
1) learning from a given expert’s policy and 2) learning from expert 
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trajectories. Learning from a given policy is a simpler problem than 
learning from trajectories. Because of the computational difficulty the 
demonstrations on benchmark problems are relatively simple. 


4.7.3 Active Inverse Reinforcement Learning as a POMDP 


With active inverse reinforcement learning we refer to learning the 
reward function when the robot is able to influence the demonstra- 
tions [Daniel et al., 2015]. An appealing way is to model the process of 
active inverse reinforcement learning as a partially observable Markov 
decision process (POMDP) where the reward function is a hidden 
quantity which the agent partially observes. Solving the POMDP then 
yields optimal actions for both gathering information about the reward 
function and other task specific objectives. Computational methods 
exist for both parametric [Dearden et al., 1999, Poupart and Vlas- 
sis, 2008] and non-parametric [Doshi-Velez et al., 2012, 2015] learning 
of the reward function when the IRL problem itself is modeled as a 
POMDP. The main disadvantage of POMDPs is the high computa- 
tional complexity. The current application of POMDPs for active IRL 
in robotic applications is limited but an interesting avenue for future 
work since POMDPs offer a principled way of modeling IRL. For exam- 
ple, POMDPs do not suffer from the exploration-exploitation dilemma 
which could be a useful property in active IRL. 


4.7.4 Cooperative Inverse Reinforcement Learning 


In the vein of the approaches discussed above, Hadfield-Menell et al. 
[2016] frame the problem of IRL as learning a hidden reward func- 
tion as a partially observable Markov decision process (POMDP). 
Hadfield-Menell et al. [2016] define and study the cooperative inverse 
reinforcement learning (CIRL) problem. A CIRL is a two player game 
where the human observes the reward function but the robot not. Tra- 
ditional IRL [Ng and Russell, 2000] assumes that the demonstrator is 
acting based on an optimal policy. Hadfield-Menell et al. [2016] show 
that in CIRL, the human may accept sub-optimal reward if it can 
provide the robot with more information. CIRL defines optimal be- 
havior for both the human and the robot when optimizing reward for 
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the human. CIRL potentially leads to policies where human teaching 
and robot learning are jointly optimized. Hadfield-Menell et al. [2016] 
show that finding optimal policies for the human and robot in CIRL 
corresponds to solving a POMDP. A drawback of the POMDP model 
is that in practice exact optimal solutions for the model are hard to 
come by but the POMDP model can be used as a theoretical tool and 
a basis for practical solutions. 

Hadfield-Menell et al. [2016] demonstrate the CIRL framework in 
simple simulated scenarios. Considering more complicated robotic ex- 
periments, the traditional way of IRL of performing close to optimal 
demonstrations could be easier for a human compared to teaching a 
robot optimally. In order to perform demonstrations which teach the 
robot optimally, the human has to consider how the robot optimizes 
learning in addition to the actual task being demonstrated. 


4.8 Robot Applications with Inverse Reinforcement 
Learning Methods 


Inverse reinforcement learning has been used for tasks such as parsing 
sentences Neu and Szepesvari [2009], car driving Abbeel and Ng [2004], 
path planning Ratliff et al. [2006b], Silver et al. [2010], Zucker et al. 
[2011], and robot motions Boularias et al. [2011], Finn et al. [2016b]. 
First, we review applications of model-based inverse reinforcement 
learning methods. Since model-based IRL methods assume that the 
dynamics of the system is available, they have been applied to prob- 
lems where the system dynamics is completely known such as a driv- 
ing simulator. Thereafter, we review applications of model-free inverse 
reinforcement learning methods. Since model-free IRL methods do not 
require prior knowledge of the system dynamics, they can be applied to 
robotic tasks where the dynamics of a manipulator is hard to obtain. 


4.8.1 Learning to Drive a Car in a Simulator 


Simulating car-driving is a typical application which can be modeled 
as an MDP problem. It is often assumed that the policy is stationary 
(independent of time) and that the state-action space can be approx- 
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Figure 4.2: Screen shot of the driving simulator used in [Abbeel and Ng, 2004]. A 
time-invariant policy was learned using a model-based IRL method. Experimental 
results show that a different driving style can be learning using different demonstra- 
tion data. 


imated by a set of discrete states and actions. Abbeel et al. demon- 
strated the performance of IRL in a car-driving simulation shown in 
Figure 4.3 [Abbeel and Ng, 2004]. In the car simulation, five actions 
were available, three of which were to steer the car to one of the lanes, 
and two of which were to drive off the road on the left or the right side. 
The expert’s features were computed from a single trajectory of 1200 
samples. In this experiment, different driving styles were demonstrated 
by the expert. The results show that the method in [Abbeel and Ng, 
2004] is able to imitate different driving styles. 


4.8.2 Learning Path Planning with MMP 


Ratliff et al. [2006b], Silver et al. [2010] apply maximum margin plan- 
ning (MMP) and LEARCH for finding a path with minimum accu- 
mulated cost (see Figure 4.3). Interestingly, from raw perceptual data, 
lattice planners can be taught human-like rough terrain driving more 
efficiently compared to manually programmed behavior Silver et al. 
[2010]. LEARCH learns the cost as a function of features and the op- 
timal path can be found by using classic motion planning methods on 
the recovered cost function. The features of the MDP are based on 
visual (images/lidar) input as shown in Figure 4.4. The learned cost 
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Figure 4.3: The learning to search (LEARCH) approach for identifying a cost func- 
tion has been applied to various robotic applications including learning rough terrain 
navigation from sensor data. The approach iterates between building a discrimina- 
tive classifier between states visited by the learner and the demonstrator, updating 
the cost function with the discriminative classifier, and then using classical path 
planning methods to identify a new proposed optimal plan. 


Figure 4.4: Examples of path planning with LEARCH [Silver et al., 2010]. Top 
figures show the satellite images and the bottom figures show the costs. The cost 
function evolves from left to right in the learning process. The red line represents the 
example path and the green represents the current plan. The learned cost function 
reproduces paths more similar to the example path as the learning evolves. The 
upper set of images shows the raw visual (camera) data being interpreted by the 


learner, the lower images show the interpretation in terms of costs (white expensive, 
dark low-cost). 
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Figure 4.5: Learning house-keeping tasks in [Finn et al., 2016b]. Tasks that require 
a nonlinear reward function and a complex policy were learned using guided cost 
learning. 


function reproduces paths incrementally more similar to the example 
path as the learning evolves. MMP and LEARCH have been applied to 
various robotic systems, including footstep planning for a quadruped 
robot [Zucker et al., 2011]. 


4.8.3 Learning Motion Planning with Deep Guided-Cost 
Learning 


Learning manipulation tasks often requires nonlinear reward functions. 
Finn et al. [2016b] applied guided cost learning to house-keeping tasks 
such as moving dishes and pouring water shown in Figure 4.5. Demon- 
strations were recorded using kinesthetic teaching with a PR2 robot. 
As we described in §4.4.6, guided cost learning uses a neural network 
to represent the reward function. The state of the system was rep- 
resented by vision-based features obtained by using an unsupervised 
learning method [Finn et al., 2016b]. The experimental results show 
that guided cost learning can be used to learn robotic manipulation 
tasks that require a nonlinear reward function under unknown dynam- 
ics. 
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Figure 4.6: Learning ball-in-the-cup in [Boularias et al., 2011]. The KL divergence 
between the expert policy and the learner’s policy was minimized using a sampling- 
based method. 


4.8.4 Learning a Ball-in-a-Cup task with Relative Entropy 
Inverse Reinforcement Learning 


Learning robotic tasks with an underactuated manipulator is non- 
trivial because the dynamics of the system is hard to estimate. Since 
model-based IRL methods require an accurate model of the system 
dynamics, applying model-based IRL methods to such tasks can be 
challenging. Boularias et al. [2011] applied the model-free Relative En- 
tropy Inverse Reinforcement Learning (RE-IRL) approach to the Ball- 
in-a-cup task with an underactuated robot shown in Figure 4.6. A hu- 
man demonstrated the ball-in-a-cup motion 17 times, and the motions 
were recorded using a 3D motion capture system. Robotic simulations 
showed successful learning of the demonstrated motion. 


5 


Challenges in Imitation Learning for 
Robotics 


We have surveyed the state of the art in imitation learning for robotics. 
Although imitation learning has progressed rapidly, it is clear that there 
are still many problems and challenges which need to be investigated. 
In this section, we highlight open questions and technical challenges in 
imitation learning. 


5.1 Behavioral Cloning vs Inverse Reinforcement Learn- 
ing 


Behavioral cloning (BC) and inverse reinforcement learning (IRL) 
methods form the two major classes of imitation learning methods. 
As discussed in § 2, “BC vs IRL” is the first question that one needs 
to answer when applying imitation learning to the problem at hand. 
Recovering the reward function can be interpreted as inferring the 
expert’s intent since the reward function encodes the objective for the 
desired task. For example, when learning from a sequence of images 
without kinematic information of the expert, it is not clear how to ap- 
ply behavioral cloning. In such a case, we need to infer what is desired 
by the expert and then estimate a policy to achieve the inferred goal. 
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For example, to address the problem of imitation from observation, the 
recent work by Sermanet et al. [2017] and Liu et al. [2017] proposed 
methods for recovering the reward function from visual features ex- 
tracted by deep neural networks. Thus, IRL is a reasonable choice for 
such problems where inference of the expert’s intent is necessary even 
if the policy itself is more compact than a reward function. 

When both behavioral cloning or inverse reinforcement learning can 
be applied to a given problem, it is essential to consider “what is the 
most parsimonious description of the desired behavior, reward or pol- 
icy?”. Ho and Ermon [2016] recently indicated that under the maxi- 
mum entropy assumption recovering the reward function is the dual 
of matching the expectation of states and actions. This implies that 
BC and IRL can be equivalent under certain assumptions since BC 
methods learn a policy by matching the expectation of states and ac- 
tions and IRL methods learn a policy based on the reward function 
recovered by matching the expectation of states and actions. Since IRL 
recovers the “hidden” reward function, IRL often adds complexity to 
the solution approach compared to BC. Thus, in order to select BC or 
IRL, it is essential to clarify whether recovering the reward function is 
beneficial or not. 

For instance, recovering a reward function for a manipulation task 
is often difficult since it is not trivial to extract features of the given 
scene which are relevant to the task. On the other hand, the distri- 
bution of the demonstrated trajectories for manipulation can be often 
learned without recovering the reward function. When the distribution 
of necessary trajectories can be predicted for a given context, the task 
can be performed without any knowledge about the reward function of 
the task. In this case, the distribution of the demonstrated trajectories 
can be considered a parsimonious description of the desired behavior. 

As another example, learning a reward function for footstep plan- 
ning for a quadruped robot enables generalizing the footstep planning 
strategy to different terrains. If the reward function that tells “which 
footstep location is stable” is recovered, footstep locations can be adap- 
tively selected based on this criteria. Such generalization is hard to ob- 
tain if we only learn the distribution of the footstep locations. In this 
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case, the reward function is considered a parsimonious description of 
the desired behavior, which enables good generalization of skills. 
Overall, the answer to the question “BC vs IRL” totally depends 
on the problem setting. It is essential to analyze what and how the task 
should be performed when applying imitation learning methods. 


5.2 Open Questions in Imitation Learning 


We have discussed the state of the art in imitation learning in this 
survey. Although imitation learning methods so far have demonstrated 
great capability, it is clear that there still exists several challenges to be 
solved. In this section, we highlight open questions in imitation learning 
and try to clarify what problems need to be solved. 


5.2.1 Problems Related to Demonstrated Data 


The first step of imitation learning is to collect expert demonstration 
data. However, it is often not trivial to obtain appropriate data to 
achieve satisfactory performance in imitation learning. Below we list 
questions related to data collection. 


How to learn from multiple experts? It is known that imitation 
learning methods work well for demonstrations performed by one 
expert rather than multiple experts [Camacho and Michie, 1995]. 
Therefore, when multiple human experts give instructions to a robotic 
system, one could extract one expert from multiple experts. However, 
this problem has not been sufficiently addressed. 


How to deal with undesirable motions in demonstrations? 
Many imitation learning methods assume that demonstrated behavior 
is (sub-)optimal. However, in practice, demonstrated behavior often 
contains undesirable motions which may may result in low performance 
policies. To address this issue, reinforcement learning can be used to 
improve the learned policy [Kober et al., 2013, Mnih et al., 2015, Silver 
et al., 2016]. Nevertheless, explicitly detecting unnecessary motion and 
removing it from demonstrated behavior is still an open problem. 
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How to learn from raw sensory inputs without embodiment 
information? When learning only from vision we cannot directly 
measure the kinematic information of the expert. While learning from 
raw sensory inputs without embodiment information is challenging, 
humans can do it based on prior knowledge. Recent work by Sermanet 
et al. [2017] shows that the reward function can be inferred from few 
demonstrations by using visual representations learned by deep models. 


How to deal with different viewpoints? Current imitation learn- 
ing methods are usually limited to the case where the demonstration 
is supplied in the first-person, i.e., a sequence of states and actions is 
provided similarly to how the learner would observe the task. However, 
humans can learn by observing the behavior of other humans. When 
learning from the third-person view it is necessary to infer how the 
task should be performed. Recent work on third-person imitation 
learning [Stadie et al., 2017] addresses this problem in some simple 
environments. 


How to leverage past demonstrations of other related tasks, 
to learn more quickly the current task? While it is challenging to 
learn a very complex task from one demonstration, humans can learn 
from few demonstrations because they have so much prior knowledge. 
In principle, this knowledge could be captured and reused for other 
tasks. Recent work such as [Gupta et al., 2017, Finn et al., 2017a,b, 
Duan et al., 2017] addresses this research direction. 


5.2.2 Open Questions Related to Design Choices 


When we implement imitation learning in an actual robotic system, 
we need to make several design choices as we discussed in Chapter 2. 
There are still several open questions when making such design choices. 


What is the best similarity measure of policies? To obtain a 
policy that imitates experts’ behavior, it is essential to measure the 
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similarity of policies. Although we discussed some similarity measure 
such as KL divergence and Euclidean distance, there exist many 
other options. For example, recently the Wasserstein divergence (aka 
Earth-mover distance) [Arjovsky et al., 2017] has been shown to 
improve the performance of generative adversarial networks (GANs) 
[Goodfellow et al., 2014] which have inspired some recent imitation 
learning approaches [Ho and Ermon, 2016, Finn et al., 2016a]. Ex- 
ploring new similarity measures is a promising way to discover new 
imitation learning methods which may work in situations not handled 
by current methods. 


How to learn from multiple instruction types? In practice, 
various types of instructions are available, such as corrective motion 
from operators, preferences on optional actions and evaluation of 
the performance. To achieve intuitive human-robot interaction and 
efficient learning, it is necessary to utilize various instruction types. 
Although some methods incorporate multiple instruction types Jain 
et al. [2015], this research direction has not been well-investigated yet. 


How to incorporate prior knowledge? How to do it explic- 
itly? Although prior knowledge of the system or environment, e.g., 
kinematics and the mass of a manipulator, are often available, many 
imitation learning methods utilize only demonstrations. However, 
incorporating available prior knowledge will be useful for system 
control and trajectory planning. On the other hand, many methods 
use implicit prior knowledge such as assuming a Gaussian distribution 
of samples. Methods that explicitly incorporate prior knowledge could 
lower the amount of demonstration data required and make new 
robotic applications possible. 


How to learn from various sensors? Many studies on imitation 
learning implicitly select sensory information appropriate for their 
method. However, in practice, we can use various redundant sen- 
sory information such as tactile information, RGB-D images, audio 
information, and encoders in robot joints. Fusing of various sensory 


5.2. Open Questions in Imitation Learning 157 


information will lead to more robust and adaptive behavior. 


How to learn tasks humans cannot do? Imitation learning 
methods assume that demonstrations of the desired task are available. 
However, it is often the case that human operators cannot appropri- 
ately demonstrate the given task, especially in cases where a robot has 
a physical advantage compared to a human. For example, a robotic 
system may have more than two arms making it challenging for the 
human operator to demonstrate the desired behavior. To achieve 
performance beyond human capability, methods that iteratively 
improve the performance of the system will be necessary. 


How to choose a trajectory representation? In §3.5.1, we dis- 
cussed several different trajectory representations. An interesting open 
question is how to choose among the trajectory representations. We 
gave in §3.5.2 some suggestions how to choose based on the different 
properties of the representations. However, there is no definite answer 
on how to select a trajacectory presentation. Note that choosing a 
trajectory representation is analogous to model selection in machine 
learning [Bishop, 2006]. Considering trajectory representation selection 
as a model selection problem could lead to interesting advances. 


5.2.3 Problems Related to Algorithms 


When we want to overcome limitations in current imitation learning, 
we also need to face several open questions related to algorithmic 
aspects of imitation learning. 


How to generalize skills with complex conditions? Many 
methods model the distribution over demonstrated trajectories and 
generalize the skill by conditioning the distribution Khansari-Zadeh 
and Billard [2011], Paraschos et al. [2013] for example on different 
start or end positions. However, such methods might not scale to high 
dimensional conditions. Although some work addresses scaling up 
generalization of skills with high dimensional inputs Schulman et al. 
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[2013], further investigation is necessary. Recent work by Finn et al. 
[2017b], Sermanet et al. [2017], Liu et al. [2017], and Rahmatizadeh 
et al. [2017] proposed methods for learning from visual information 
using deep neural networks, which is a promising way to address the 
skill generalization with complex conditions. 


How to find solutions with guarantees? In current imita- 
tion learning, there are performance guarantees, e.g., stability of 
DMPs [Ijspeert et al., 2002a, Schaal et al., 2004] and a proof of low 
error in DAGGER [Ross et al., 2011]. However, currently, for many 
imitation learning methods there are no performance guarantees. 
Especially in robotics, guarantees such as stability or convergence 
can be very important in practice. Finding guarantees for common 
imitation learning methods is a worthy research direction. 


How to scale up with respect to the number of dimensions? 
Motion planning in a robotic system requires a high dimensional 
solution. For example, a humanoid robot often has over 50 joints. 
However, existing imitation learning methods are often inefficient for 
such high dimensional motion due to the different embodiment of the 
learner and the expert. Recent studies show that the dimensionality of 
the input space can be scaled up using convolutional neural networks. 
However, current methods for high dimensional inputs are often limited 
to 2D images. Incorporating high dimensional sensory inputs is still 
an open question. In addition, scaling up the dimensionality of actions 
is also an open problem. Incorporating dimensionality reduction in 
imitation learning is an interesting research direction [Sugiyama et al., 
2010, Tangkaratt et al., 2015] 


How to find globally optimal solutions in high dimensional 
spaces? How to make it tractable? In robotic applications, it is 
essential to find solutions in a continuous and high dimensional space. 
Many imitation learning methods find locally optimal solutions close 
to the behavior demonstrated by experts. However, there may exist a 
better solution which is different from the demonstrated behavior. 
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How to perform imitation by multiple agents? In multi-agent 
domains, an agent needs to consider how the other agents’ behavior 
may influence the outcome. Prior work [Waugh et al., 2011, Kuleshov 
and Schrijvers, 2015] addresses how to infer the reward function, 
which represents the equilibrium of agents’ strategies, from observed 
behavior of multiple agents. However, the results are still quite limited 
to simple problem settings and have not migrated to large scale robot 
applications. 


How to perform incremental/active learning in IRL? 
Although many inverse reinforcement learning (IRL) methods assume 
a sufficient number of demonstrations, it is often not the case in 
practice. When the policy learned from the initial dataset of demon- 
strations does not show satisfactory performance, the policy can 
be incrementally improved. Silver et al. [2012], Lopes et al. [2009] 
proposed methods for IRL with active learning. Such incremental IRL 
methods have not been investigated sufficiently. 


5.2.4 Performance Evaluation 


Since the purpose and target applications of imitation learning are very 
broad, benchmarking imitation learning methods can be challenging. 
The following open questions are related to performance evaluation in 
imitation learning. 


How to establish benchmark problems for imitation learning? 
Unlike other machine learning fields, there is no widely accepted 
set of benchmark problems for imitation learning. Although efforts 
for benchmarking different techniques have been made, e.g. [Lemme 
et al., 2015], there is no clear way to compare performance between 
methods. Benchmark problems such as data mining and computer 
vision communities should be established. 


What metric should be used to evaluate imitation learning 
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methods? There are various ways to quantify imitation learning per- 
formance. However, there is no established way to evaluate imitation 
learning methods, nor are there yet large scale benchmarks that make 
it effective and easy to compare and contrast approaches. 
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